The pCAZyme classifiers dbCAN, CUPP and eCAMI were independently evaluated against a high quality benchmark test set. The performances were evaluated upon the CAZyme/non-CAZyme differentiation and multilabel classification of CAZy family annotations. This notebook contains that statistical evaluation of the CAZyme classifiers.
Results summary:
- dbCAN and DIAMOND showed the strongest performances in CAZyme/non-CAZyme differentiation
- dbCAN was the strongest performing tool across all categories, Hotpep (a tool invoked by dbCAN) was the weakest
- The performances between CUPP and eCAMI were similar, although CUPP should a marginally better performance when comparing the multilabel classification of CAZy family annotations
- The performance of dbCAN may be optimised by substituting Hotpep with CUPP and/or eCAMI
The CAZyme classifiers dbCAN (Zhange et al. 2018), CUPP (Barrett and Lange, 2019) and eCAMI (Xu et al. 2019) use different methods to predict if a protein is a CAZyme or non-CAZyme, and predict the CAZy family annotations for predicted CAZymes. These classifiers have not been independently evaluated against a high quality benchmark test set.
This notebook layouts out the independent evaluation of dbCAN, CUPP and eCAMI against a high quality benchmark test set. The tools were evaluated upon their ability to differentiate between CAZymes and non-CAZymes, and their performance of predicting the CAZy family annotations of predicted CAZymes.
dbCAN incorporates the three protein function classifiers HMMER (Potter et al. 2018), Hotpep (Busk et al. 2017), and DIAMOND (Buchfink et al. 2015). In order to comprehensively evaluate the preformance of dbCAN, the predictions from HMMER, Hotpep and DIAMOND were evaluated independently of each other, and the consensus prediction (a prediction which at least two of the tools agree upon) was defined as the dbCAN result.
A single test set of 100 CAZymes and 100 non-CAZymes with the highest sequence similarity (rated by bit-score ratio) was created per genomic assembly selected to be included in the benchmark test set. Choosing the 100 non-CAZymes with the highest sequence similarity was devised to increase the probability of causing confusion, to gather a better idea of the expected performance when using the classifiers. An equal number of CAZymes to non-CAZymes was selected to prevent over representation of one population over the other.
For inclusion of a genomic assembly for the creation of a test set, the assembly had to meet of all the following criteria:
The genomic assemblies were also chosen from a range of taxonomies to provide as informative image of the performance of the classifiers over a range of datasets that users may wish to analyse.
Table ?? contains the genomic assemblies used to create the test sets for the evaluation. In total 81 assemblies were chose, 1 from an Oomycete species (more Oomycete species with greater than 100 CAZymes in CAZy could not be found), 25 fungal Ascomycetes species were selected, 13 Yeast, 2 Eukaryote microorganisms, 20 Gram positive bacteria, and 20 Gram negative bacteria, and figure 2.1 presents the distribution of CAZome coverage all 70 genomes.
## [1] "Mean percentage of genome incorporated in the CAZome across all test sets:"
## [1] 3.140472
## [1] "Standard deviation of the percentage of genome incorporated in the CAZome across all test sets:"
## [1] 1.174488
## [1] "Mean percentage of CAZomes incorporated in the test set across all genomes:"
## [1] 64.37203
## [1] "Standard deviation of the percentage of CAZome incorporated in the test set across all genomes:"
## [1] 25.54491
Figure 2.1: Histogram of CAZome coverage of the test sets for each respective source genomic assembly, overlayed by a box and whisker plot of the percentage of the CAZome incorproated in the test set.
The assignment of CAZy family annotations by a CAZyme classifier identifies the protein as a CAZyme. If no CAZy family annotations are assigned to a protein by a CAZyme classifier, the tool identified the protein as a non-CAZyme. This notebook evaluates the performance of the CAZyme classifiers dbCAN (which incorporates HMMER, Hotpep and DIAMOND), CUPP and eCAMI for this binary CAZyme/non-CAZyme classification.
For every classifier-test set pair, the specificity, sensitivity, prevision, F1-score and accuracy were calculated.
The mean of each statistical parameter was calculated for each classifier across all tests, to represent the overall performance of each CAZyme classifier.
These results are presented in table 3.1.
| Classifier | Spec Mean | Spec Standard Deviation | Spec Lower CI | Spec Upper CI | Sens Mean | Sens Standard Deviation | Sens Lower CI | Sens Upper CI | Prec Mean | Prec Standard Deviation | Prec Lower CI | Prec Upper CI | F1-score Mean | F1-score Standard Deviation | F1-score Lower CI | F1-score Upper CI | Acc Mean | Acc Standard Deviation | Acc Lower CI | Acc Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9917 | 0.0156 | 0.9880 | 0.9954 | 0.8570 | 0.0825 | 0.8373 | 0.8767 | 0.9908 | 0.0172 | 0.9866 | 0.9949 | 0.9167 | 0.0531 | 0.9040 | 0.9293 | 0.9244 | 0.0417 | 0.9144 | 0.9343 |
| dbCAN | 0.9869 | 0.0245 | 0.9810 | 0.9927 | 0.9087 | 0.1123 | 0.8819 | 0.9355 | 0.9866 | 0.0241 | 0.9808 | 0.9923 | 0.9412 | 0.0796 | 0.9222 | 0.9602 | 0.9478 | 0.0564 | 0.9343 | 0.9612 |
| DIAMOND | 0.9844 | 0.0263 | 0.9782 | 0.9907 | 0.9261 | 0.1298 | 0.8952 | 0.9571 | 0.9847 | 0.0251 | 0.9787 | 0.9907 | 0.9481 | 0.0907 | 0.9264 | 0.9697 | 0.9553 | 0.0641 | 0.9400 | 0.9706 |
| eCAMI | 0.9836 | 0.0257 | 0.9774 | 0.9897 | 0.8610 | 0.1328 | 0.8293 | 0.8927 | 0.9826 | 0.0254 | 0.9765 | 0.9887 | 0.9112 | 0.0868 | 0.8905 | 0.9319 | 0.9223 | 0.0647 | 0.9069 | 0.9377 |
| HMMER | 0.9901 | 0.0163 | 0.9863 | 0.9940 | 0.8831 | 0.0835 | 0.8632 | 0.9030 | 0.9893 | 0.0174 | 0.9851 | 0.9935 | 0.9305 | 0.0613 | 0.9159 | 0.9451 | 0.9366 | 0.0422 | 0.9266 | 0.9467 |
| Hotpep | 0.9840 | 0.0257 | 0.9779 | 0.9901 | 0.8189 | 0.1327 | 0.7872 | 0.8505 | 0.9815 | 0.0287 | 0.9747 | 0.9884 | 0.8862 | 0.0917 | 0.8643 | 0.9081 | 0.9014 | 0.0666 | 0.8855 | 0.9173 |
Owing to the skewing of the data towards 1, the 95% confidence interval (CI) was calculated and plotted as error bars around the mean CI and illustrated in figure 3.1.
Figure 3.1: Summary statistics of CAZyme classifiers performances of binary CAZyme/non-CAZyme prediction. The mean plus and minus the 95% confidence interval.
Specificity is the proportion of known negatives (known non-CAZymes) which are correctly classified as negatives (non-CAZymes).
Figure 3.2 is a graphical representation of the results calculated in table 3.1.
Figure 3.2: One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.
Sensitivity (also known as recall) is the proportion of known positives (CAZymes) that are correctly identified as positives (CAZymes).
Figure 3.3 graphically represents of the results calculated in table 3.1.
Figure 3.3: One-dimensional scatter plot of recall (sensitivity) scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.
Precision is the proportion of positive predictions by the classifiers that are correct.
In this case, precision represents the fraction of CAZyme predictions by the classifiers that are correct, specifically the proportion of predicted CAZymes that are known CAZymes.
Figure 3.4 is a visual representation of the results calculated in table 3.1.
Figure 3.4: One-dimensional scatter plot of precision scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.
The F1-score is a harmonic (or weighted) average of recall and precision and provides an idea of the overall performance of the tool, 0 being the lowest and 1 being the best performance. Figure 3.5 shows the F1-score from each test set, for each classifier.
Figure 3.5: Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.
Accuarcy (calculated using (TP + TN) / (TP + TN + FP + FN) ) provides an idea of the overall performance of the classifiers as a measure of the degree to which their CAZyme/non-CAZyme predictions conforms to the correct result. Figure 3.6 is a plot of respective data from table 3.1.
Figure 3.6: Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.
The statistics evaluated above provide an idea of the general performance of the tools, but they do not provide an idea of the expect range of performance. Specifically, the data does not provide a clear image of the best and worse performance a user can expect when using these tools.
To compare the expected typical range in accuracies for each classifier, 6 test sets (identified by the source genomic assemblies) were selected at random. The CAZyme/non-CAZyme predictions for each classifier, for each test set, were bootstrap resampled 100 times each, and for each bootstrap sample the accuracy calculated. The accuracies of the bootstrap samples for each classifier were plotted on stacked histograms, shown in figure 3.7.
Figure 3.7: Stacked histograms of bootstrap sample accuracies of CAZyme classifiers’ differentiation between CAZymes and non-CAZymes. 6 test sets (identified by their source genomic assembly) were selected at random. The CAZyme/non-CAZyme predictions for each classifier, for each test set, were bootstrap resampled 100 times. The accuracy of each of the 600 bootstrap samples per test set were plotted as a stacked histogram.
Overall, all tools showed a low probability of producing false positives (missclassifying a non-CAZyme as a CAZyme), and few of the positive predictions are false positives. Therefore, we can be confident in that the CAZyme predictions made by each of these tools are most likely correct. However, all the classifiers demonstrated a consistent behaviour to not identify all CAZymes within a CAZome. Therefore, we can be confident in the CAZyme predictions, but should not presume all non-CAZyme predictions are correct; these classifiers are unlikely to identify the complete CAZome although a near-complete CAZome will be accurately identified.
dbCAN consistently demonstrated the strongest performance in all categories, inferring that eCAMI and CUPP are not suitable replacements of the CAZyme classifier. Hotpep consistently demonstrated the weakest performance, and is incorporated within dbCAN. Therefore, substituting eCAMI and/or CUPP into dbCAN instead of Hotpep may further improve the performance of dbCAN. The new k-mer based methods, eCAMI and CUPP demonstrated similar performances. CUPP showed a more consistent performance and eCAMI demonstrating a greater range in performance although its mean performance was fractionally greater than that of CUPP. However, more bootstrap calculated accuracy scores feel within the range of 0.9-1.0 for CUPP than eCAMI. This infers that a CUPP may typically provide a better performance than eCAMI, although eCAMI does have the potential on some occasions to out perform CUPP, depending on the test set.
CAZy groups CAZymes into CAZy families by sequence similarity, and CAZy families are grouped into one of 6 functional classes. The CAZyme classifiers predict the CAZy family annotations of predicted CAZymes, but it is of interest to see what the level of performance of the classiferis is at the CAZy class level. Specifically, a classifier may struggle to predict the correct CAZy class for a CAZyme but consistently predict the correct CAZy class. Therefore, the aim of this part of the evaluation is to evaluate the performance of the classifiers to predict the correct CAZy class of predict CAZymes.
Below is a table summary all statistical parameters calculated in order to evaluate the performance of the CAZy class classification for each prediction tool across all CAZy classes.
| Classifier | Spec Mean | Spec Standard Deviation | Spec Lower CI | Spec Upper CI | Sens Mean | Sens Standard Deviation | Sens Lower CI | Sens Upper CI | Prec Mean | Prec Standard Deviation | Prec Lower CI | Prec Upper CI | F1-score Mean | F1-score Standard Deviation | F1-score Lower CI | F1-score Upper CI | Acc Mean | Acc Standard Deviation | Acc Lower CI | Acc Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dbCAN | 0.9962 | 0.0121 | 0.9950 | 0.9975 | 0.8928 | 0.1721 | 0.8748 | 0.9109 | 0.9626 | 0.1298 | 0.9490 | 0.9762 | 0.9170 | 0.1460 | 0.9017 | 0.9323 | 0.9755 | 0.0429 | 0.9710 | 0.9800 |
| HMMER | 0.9966 | 0.0103 | 0.9955 | 0.9977 | 0.8270 | 0.2407 | 0.8017 | 0.8522 | 0.9612 | 0.1388 | 0.9466 | 0.9757 | 0.8675 | 0.2013 | 0.8464 | 0.8886 | 0.9686 | 0.0392 | 0.9645 | 0.9727 |
| Hotpep | 0.9749 | 0.0471 | 0.9700 | 0.9799 | 0.8317 | 0.2120 | 0.8095 | 0.8540 | 0.8576 | 0.2495 | 0.8314 | 0.8837 | 0.8207 | 0.2116 | 0.7985 | 0.8429 | 0.9421 | 0.0673 | 0.9350 | 0.9491 |
| DIAMOND | 0.9956 | 0.0130 | 0.9942 | 0.9969 | 0.9078 | 0.1960 | 0.8872 | 0.9283 | 0.9578 | 0.1526 | 0.9418 | 0.9738 | 0.9213 | 0.1725 | 0.9032 | 0.9394 | 0.9816 | 0.0426 | 0.9771 | 0.9861 |
| CUPP | 0.9975 | 0.0097 | 0.9964 | 0.9985 | 0.7118 | 0.3888 | 0.6711 | 0.7526 | 0.7695 | 0.4098 | 0.7265 | 0.8124 | 0.7343 | 0.3937 | 0.6930 | 0.7756 | 0.9554 | 0.0635 | 0.9487 | 0.9620 |
| eCAMI | 0.9852 | 0.0324 | 0.9818 | 0.9886 | 0.8362 | 0.2157 | 0.8137 | 0.8588 | 0.8966 | 0.2066 | 0.8749 | 0.9182 | 0.8487 | 0.1950 | 0.8282 | 0.8691 | 0.9590 | 0.0536 | 0.9534 | 0.9646 |
Figure 4.1: Summary statistics of CAZyme classifiers performances of CAZy class classification, plotting the mean plus and minus the 95% confidence interval.
Below a proportional area plot representing the F-beta score for each CAZyme classifier for each test set is generated. each square is sized proportional to the relative sample size. Every class was not included in every sample, resulting in different sample sizes between CAZy classes, the same between classifiers.
Figure 4.2: Proportional area plot of CAZy class classification performance. Performance is represented by the F1-score. Plots are proptional to the number of test sets, after excluding true negative classifications.
A dataframe of the number of test sets containing each CAZy class is generated (table ??).
## Prediction_tool GH GT PL CE AA CBM
## 1 dbCAN 70 70 38 67 37 70
## 2 HMMER 70 70 38 67 37 70
## 3 DIAMOND 70 70 38 67 37 70
## 4 Hotpep 70 70 38 67 37 70
## 5 CUPP 70 70 38 67 37 70
## 6 eCAMI 70 70 39 67 37 70
The sensitivity of each CAZyme classifier can be plotted against the specificity for each CAZy class, however plotting all CAZy classes in a single plot produces a cramped plot, unless very few test sets were used.
Figure 4.3: Scatter plot of sensitivity against specificity for predicting CAZy class members per CAZyme classier
Below the prediction sensitivity is plotted against the specificity for each classifier, and a separate plot is generated for each CAZy class.
The scatter plots of sensitivity against specificity overlay a coloured contour to highlight the distribution of the points. When too many points have the same value a contour cannot be generated. In order to plot a contour noise is added to the data. The original data is used to plot the scatter plot and the data with added noise is used to plot the contour.
The percentage of the data points which need noise to be added to them in order to generate a contour varies from data set to data set. To change the percentage of the data points with noise added to them, change the third value of call to the function plot.class.sens.vs.spec(), which is used to generate the plots. The third value is the percentage of data points to add noise to, written in decimal form.
Figure 4.4: Scatter plot of sensitivity against specificity for predicting GH CAZy class members per CAZyme classier, overlaying a density map.
| Prediction_tool | Spec Mean | Spec Standard Deviation | Spec CI Lower | Spec CI Upper | Sens Mean | Sens Standard Deviation | Sens CI Lower | Sens CI Upper | Prec Mean | Prec Standard Deviation | Prec CI Lower | Prec CI Upper | F1-score Mean | F1-score Standard Deviation | F1-score CI Lower | F1-score CI Upper | Acc Mean | Acc Standard Deviation | Acc CI Lower | Acc CI Upper |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9955 | 0.0119 | 0.9927 | 0.9983 | 0.9136 | 0.0685 | 0.8973 | 0.9300 | 0.9957 | 0.0112 | 0.9930 | 0.9983 | 0.9514 | 0.0402 | 0.9418 | 0.9609 | 0.9581 | 0.0279 | 0.9514 | 0.9647 |
| dbCAN | 0.9960 | 0.0110 | 0.9934 | 0.9986 | 0.9209 | 0.1017 | 0.8966 | 0.9451 | 0.9947 | 0.0182 | 0.9903 | 0.9990 | 0.9527 | 0.0661 | 0.9370 | 0.9685 | 0.9625 | 0.0382 | 0.9534 | 0.9716 |
| DIAMOND | 0.9934 | 0.0141 | 0.9900 | 0.9967 | 0.9447 | 0.1054 | 0.9196 | 0.9699 | 0.9921 | 0.0224 | 0.9868 | 0.9975 | 0.9639 | 0.0689 | 0.9475 | 0.9803 | 0.9715 | 0.0413 | 0.9617 | 0.9814 |
| eCAMI | 0.9886 | 0.0205 | 0.9837 | 0.9935 | 0.8834 | 0.1104 | 0.8570 | 0.9097 | 0.9887 | 0.0214 | 0.9836 | 0.9938 | 0.9286 | 0.0660 | 0.9129 | 0.9444 | 0.9423 | 0.0446 | 0.9317 | 0.9529 |
| HMMER | 0.9957 | 0.0108 | 0.9931 | 0.9982 | 0.9151 | 0.0834 | 0.8952 | 0.9350 | 0.9944 | 0.0145 | 0.9909 | 0.9978 | 0.9506 | 0.0583 | 0.9367 | 0.9645 | 0.9585 | 0.0342 | 0.9503 | 0.9667 |
| Hotpep | 0.9853 | 0.0234 | 0.9797 | 0.9909 | 0.8825 | 0.1106 | 0.8562 | 0.9089 | 0.9842 | 0.0294 | 0.9772 | 0.9912 | 0.9263 | 0.0685 | 0.9100 | 0.9426 | 0.9403 | 0.0424 | 0.9302 | 0.9504 |
Figure 4.5: Summary statistics of CAZyme classifiers performances of GH class classification, plotting the mean plus and minus the 95% confidence interval.
Figure 4.6: One dimensional scatter plot of the specificity per test set for the classification of GH class members, overlaying a box plot
Figure 4.7: One dimensional scatter plot of the sensitivity per test set for the classification of GH class members, overlaying a box plot
Figure 4.8: One dimensional scatter plot of the precision per test set for the classification of GH class members, overlaying a box plot
Figure 4.9: One dimensional scatter plot of the F1-score per test set for the classification of GH class members, overlaying a box plot
Figure 4.10: One dimensional scatter plot of the accuracy per test set for the classification of GH class members, overlaying a box plot
Figure 4.11: Scatter plot of sensitivity against specificity for predicting GT CAZy class members per CAZyme classier, overlaying a density map.
| Prediction_tool | Spec Mean | Spec Standard Deviation | Spec CI Lower | Spec CI Upper | Sens Mean | Sens Standard Deviation | Sens CI Lower | Sens CI Upper | Prec Mean | Prec Standard Deviation | Prec CI Lower | Prec CI Upper | F1-score Mean | F1-score Standard Deviation | F1-score CI Lower | F1-score CI Upper | Acc Mean | Acc Standard Deviation | Acc CI Lower | Acc CI Upper |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9981 | 0.0078 | 0.9962 | 1.0000 | 0.8657 | 0.1188 | 0.8374 | 0.8940 | 0.9971 | 0.0115 | 0.9944 | 0.9999 | 0.9220 | 0.0759 | 0.9039 | 0.9401 | 0.9493 | 0.0581 | 0.9354 | 0.9632 |
| dbCAN | 0.9990 | 0.0065 | 0.9975 | 1.0006 | 0.8827 | 0.1393 | 0.8495 | 0.9159 | 0.9988 | 0.0080 | 0.9969 | 1.0007 | 0.9300 | 0.0983 | 0.9065 | 0.9534 | 0.9549 | 0.0727 | 0.9376 | 0.9722 |
| DIAMOND | 0.9977 | 0.0086 | 0.9956 | 0.9997 | 0.9314 | 0.1483 | 0.8961 | 0.9668 | 0.9968 | 0.0120 | 0.9940 | 0.9997 | 0.9550 | 0.1052 | 0.9299 | 0.9800 | 0.9702 | 0.0768 | 0.9519 | 0.9885 |
| eCAMI | 0.9980 | 0.0090 | 0.9958 | 1.0002 | 0.8529 | 0.1627 | 0.8141 | 0.8917 | 0.9978 | 0.0098 | 0.9954 | 1.0001 | 0.9101 | 0.1109 | 0.8837 | 0.9366 | 0.9417 | 0.0800 | 0.9226 | 0.9608 |
| HMMER | 0.9979 | 0.0096 | 0.9956 | 1.0002 | 0.8747 | 0.1080 | 0.8489 | 0.9005 | 0.9980 | 0.0092 | 0.9958 | 1.0002 | 0.9279 | 0.0768 | 0.9095 | 0.9462 | 0.9544 | 0.0532 | 0.9417 | 0.9671 |
| Hotpep | 0.9984 | 0.0070 | 0.9967 | 1.0001 | 0.7253 | 0.1889 | 0.6802 | 0.7703 | 0.9966 | 0.0132 | 0.9934 | 0.9997 | 0.8242 | 0.1433 | 0.7900 | 0.8584 | 0.8996 | 0.0899 | 0.8782 | 0.9210 |
Figure 4.12: Summary statistics of CAZyme classifiers performances of GT class classification, plotting the mean plus and minus the 95% confidence interval.
Figure 4.13: One dimensional scatter plot of the specificity per test set for the classification of GT class members, overlaying a box plot
Figure 4.14: One dimensional scatter plot of the sensitivity per test set for the classification of GT class members, overlaying a box plot
Figure 4.15: One dimensional scatter plot of the precision per test set for the classification of GT class members, overlaying a box plot
Figure 4.16: One dimensional scatter plot of the F1-score per test set for the classification of GT class members, overlaying a box plot
Figure 4.17: One dimensional scatter plot of the accuracy per test set for the classification of GT class members, overlaying a box plot
Figure 4.18: Scatter plot of sensitivity against specificity for predicting PL CAZy class members per CAZyme classier, overlaying a density map.
| Prediction_tool | Spec Mean | Spec Standard Deviation | Spec CI Lower | Spec CI Upper | Sens Mean | Sens Standard Deviation | Sens CI Lower | Sens CI Upper | Prec Mean | Prec Standard Deviation | Prec CI Lower | Prec CI Upper | F1-score Mean | F1-score Standard Deviation | F1-score CI Lower | F1-score CI Upper | Acc Mean | Acc Standard Deviation | Acc CI Lower | Acc CI Upper |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9992 | 0.0028 | 0.9983 | 1.0001 | 0.7751 | 0.3573 | 0.6576 | 0.8925 | 0.8493 | 0.3421 | 0.7369 | 0.9618 | 0.7957 | 0.3402 | 0.6839 | 0.9075 | 0.9919 | 0.0141 | 0.9872 | 0.9965 |
| dbCAN | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.8600 | 0.2674 | 0.7721 | 0.9479 | 0.9474 | 0.2263 | 0.8730 | 1.0217 | 0.8911 | 0.2451 | 0.8105 | 0.9716 | 0.9950 | 0.0083 | 0.9923 | 0.9978 |
| DIAMOND | 0.9995 | 0.0023 | 0.9987 | 1.0002 | 0.8838 | 0.2641 | 0.7970 | 0.9706 | 0.9305 | 0.2375 | 0.8524 | 1.0085 | 0.8948 | 0.2479 | 0.8133 | 0.9763 | 0.9958 | 0.0072 | 0.9935 | 0.9982 |
| eCAMI | 0.9992 | 0.0028 | 0.9983 | 1.0001 | 0.7547 | 0.3215 | 0.6505 | 0.8589 | 0.8880 | 0.3069 | 0.7886 | 0.9875 | 0.8035 | 0.3049 | 0.7047 | 0.9023 | 0.9901 | 0.0154 | 0.9851 | 0.9951 |
| HMMER | 0.9995 | 0.0033 | 0.9984 | 1.0006 | 0.8884 | 0.2465 | 0.8074 | 0.9694 | 0.9342 | 0.2374 | 0.8562 | 1.0122 | 0.9061 | 0.2372 | 0.8281 | 0.9840 | 0.9955 | 0.0083 | 0.9928 | 0.9983 |
| Hotpep | 0.9985 | 0.0053 | 0.9968 | 1.0003 | 0.8213 | 0.2927 | 0.7251 | 0.9175 | 0.9089 | 0.2738 | 0.8189 | 0.9989 | 0.8534 | 0.2736 | 0.7635 | 0.9434 | 0.9917 | 0.0155 | 0.9866 | 0.9967 |
Figure 4.19: Summary statistics of CAZyme classifiers performances of PL class classification, plotting the mean plus and minus the 95% confidence interval.
Figure 4.20: One dimensional scatter plot of the specificity per test set for the classification of PL class members, overlaying a box plot
Figure 4.21: One dimensional scatter plot of the sensitivity per test set for the classification of PL class members, overlaying a box plot
Figure 4.22: One dimensional scatter plot of the precision per test set for the classification of PL class members, overlaying a box plot
Figure 4.23: One dimensional scatter plot of the F1-score per test set for the classification of PL class members, overlaying a box plot
Figure 4.24: One dimensional scatter plot of the accuracy per test set for the classification of PL class members, overlaying a box plot
Figure 4.25: Scatter plot of sensitivity against specificity for predicting CE CAZy class members per CAZyme classier, overlaying a density map.
| Prediction_tool | Spec Mean | Spec Standard Deviation | Spec CI Lower | Spec CI Upper | Sens Mean | Sens Standard Deviation | Sens CI Lower | Sens CI Upper | Prec Mean | Prec Standard Deviation | Prec CI Lower | Prec CI Upper | F1-score Mean | F1-score Standard Deviation | F1-score CI Lower | F1-score CI Upper | Acc Mean | Acc Standard Deviation | Acc CI Lower | Acc CI Upper |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9977 | 0.0085 | 0.9956 | 0.9997 | 0.9352 | 0.1498 | 0.8986 | 0.9717 | 0.9606 | 0.1455 | 0.9252 | 0.9961 | 0.9429 | 0.1383 | 0.9092 | 0.9766 | 0.9946 | 0.0095 | 0.9923 | 0.9969 |
| dbCAN | 0.9959 | 0.0161 | 0.9920 | 0.9998 | 0.9646 | 0.1464 | 0.9289 | 1.0003 | 0.9520 | 0.1664 | 0.9114 | 0.9926 | 0.9510 | 0.1507 | 0.9142 | 0.9877 | 0.9948 | 0.0155 | 0.9910 | 0.9986 |
| DIAMOND | 0.9958 | 0.0167 | 0.9917 | 0.9998 | 0.9174 | 0.2219 | 0.8632 | 0.9715 | 0.9361 | 0.2050 | 0.8861 | 0.9861 | 0.9128 | 0.2107 | 0.8614 | 0.9642 | 0.9925 | 0.0182 | 0.9880 | 0.9969 |
| eCAMI | 0.9941 | 0.0166 | 0.9901 | 0.9982 | 0.8396 | 0.2646 | 0.7751 | 0.9041 | 0.8992 | 0.2344 | 0.8421 | 0.9564 | 0.8490 | 0.2384 | 0.7909 | 0.9072 | 0.9885 | 0.0176 | 0.9842 | 0.9928 |
| HMMER | 0.9977 | 0.0081 | 0.9957 | 0.9996 | 0.9493 | 0.1129 | 0.9217 | 0.9768 | 0.9748 | 0.0772 | 0.9560 | 0.9936 | 0.9554 | 0.0794 | 0.9360 | 0.9748 | 0.9952 | 0.0089 | 0.9930 | 0.9973 |
| Hotpep | 0.9933 | 0.0173 | 0.9891 | 0.9975 | 0.8945 | 0.2320 | 0.8379 | 0.9511 | 0.8950 | 0.2385 | 0.8368 | 0.9532 | 0.8832 | 0.2235 | 0.8286 | 0.9377 | 0.9896 | 0.0176 | 0.9853 | 0.9939 |
Figure 4.26: Summary statistics of CAZyme classifiers performances of CE class classification, plotting the mean plus and minus the 95% confidence interval.
Figure 4.27: One dimensional scatter plot of the specificity per test set for the classification of CE class members, overlaying a box plot
Figure 4.28: One dimensional scatter plot of the sensitivity per test set for the classification of CE class members, overlaying a box plot
Figure 4.29: One dimensional scatter plot of the precision per test set for the classification of CE class members, overlaying a box plot
Figure 4.30: One dimensional scatter plot of the F1-score per test set for the classification of CE class members, overlaying a box plot
Figure 4.31: One dimensional scatter plot of the accuracy per test set for the classification of CE class members, overlaying a box plot
Figure 4.32: Scatter plot of sensitivity against specificity for predicting AA CAZy class members per CAZyme classier, overlaying a density map.
| Prediction_tool | Spec Mean | Spec Standard Deviation | Spec CI Lower | Spec CI Upper | Sens Mean | Sens Standard Deviation | Sens CI Lower | Sens CI Upper | Prec Mean | Prec Standard Deviation | Prec CI Lower | Prec CI Upper | F1-score Mean | F1-score Standard Deviation | F1-score CI Lower | F1-score CI Upper | Acc Mean | Acc Standard Deviation | Acc CI Lower | Acc CI Upper |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9930 | 0.0187 | 0.9868 | 0.9993 | 0.9165 | 0.1213 | 0.8760 | 0.9569 | 0.9383 | 0.1497 | 0.8884 | 0.9882 | 0.9169 | 0.1226 | 0.8760 | 0.9578 | 0.9862 | 0.0248 | 0.9779 | 0.9945 |
| dbCAN | 0.9930 | 0.0196 | 0.9865 | 0.9996 | 0.9372 | 0.1147 | 0.8989 | 0.9754 | 0.9390 | 0.1492 | 0.8892 | 0.9887 | 0.9294 | 0.1241 | 0.8881 | 0.9708 | 0.9886 | 0.0251 | 0.9803 | 0.9970 |
| DIAMOND | 0.9930 | 0.0194 | 0.9866 | 0.9995 | 0.8796 | 0.2475 | 0.7971 | 0.9622 | 0.9143 | 0.2099 | 0.8443 | 0.9843 | 0.8743 | 0.2267 | 0.7987 | 0.9499 | 0.9872 | 0.0225 | 0.9797 | 0.9947 |
| eCAMI | 0.9936 | 0.0169 | 0.9880 | 0.9992 | 0.8422 | 0.1926 | 0.7780 | 0.9064 | 0.9374 | 0.1505 | 0.8872 | 0.9876 | 0.8679 | 0.1556 | 0.8160 | 0.9198 | 0.9818 | 0.0273 | 0.9727 | 0.9909 |
| HMMER | 0.9925 | 0.0187 | 0.9862 | 0.9987 | 0.9671 | 0.0673 | 0.9447 | 0.9896 | 0.9345 | 0.1462 | 0.8857 | 0.9832 | 0.9429 | 0.1066 | 0.9073 | 0.9784 | 0.9891 | 0.0218 | 0.9818 | 0.9963 |
| Hotpep | 0.9928 | 0.0201 | 0.9861 | 0.9995 | 0.9225 | 0.1319 | 0.8785 | 0.9664 | 0.9370 | 0.1536 | 0.8858 | 0.9883 | 0.9190 | 0.1311 | 0.8753 | 0.9627 | 0.9873 | 0.0256 | 0.9788 | 0.9958 |
Figure 4.33: Summary statistics of CAZyme classifiers performances of AA class classification, plotting the mean plus and minus the 95% confidence interval.
Figure 4.34: One dimensional scatter plot of the specificity per test set for the classification of AA class members, overlaying a box plot
Figure 4.35: One dimensional scatter plot of the sensitivity per test set for the classification of AA class members, overlaying a box plot
Figure 4.36: One dimensional scatter plot of the precision per test set for the classification of AA class members, overlaying a box plot
Figure 4.37: One dimensional scatter plot of the F1-score per test set for the classification of AA class members, overlaying a box plot
Figure 4.38: One dimensional scatter plot of the accuracy per test set for the classification of AA class members, overlaying a box plot
Figure 4.39: Scatter plot of sensitivity against specificity for predicting CBM CAZy class members per CAZyme classier, overlaying a density map.
| Prediction_tool | Spec Mean | Spec Standard Deviation | Spec CI Lower | Spec CI Upper | Sens Mean | Sens Standard Deviation | Sens CI Lower | Sens CI Upper | Prec Mean | Prec Standard Deviation | Prec CI Lower | Prec CI Upper | F1-score Mean | F1-score Standard Deviation | F1-score CI Lower | F1-score CI Upper | Acc Mean | Acc Standard Deviation | Acc CI Lower | Acc CI Upper |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.8852 | 0.0898 | 0.8638 | 0.9066 |
| dbCAN | 0.9937 | 0.0103 | 0.9912 | 0.9962 | 0.8007 | 0.1975 | 0.7536 | 0.8478 | 0.9254 | 0.1243 | 0.8958 | 0.9551 | 0.8433 | 0.1547 | 0.8064 | 0.8802 | 0.9729 | 0.0272 | 0.9664 | 0.9794 |
| DIAMOND | 0.9947 | 0.0101 | 0.9923 | 0.9971 | 0.8659 | 0.2031 | 0.8175 | 0.9143 | 0.9429 | 0.1395 | 0.9097 | 0.9762 | 0.8924 | 0.1669 | 0.8526 | 0.9322 | 0.9820 | 0.0235 | 0.9764 | 0.9876 |
| eCAMI | 0.9482 | 0.0513 | 0.9359 | 0.9604 | 0.8116 | 0.2202 | 0.7591 | 0.8641 | 0.6838 | 0.1874 | 0.6391 | 0.7285 | 0.7218 | 0.1762 | 0.6798 | 0.7638 | 0.9354 | 0.0512 | 0.9232 | 0.9476 |
| HMMER | 0.9960 | 0.0082 | 0.9941 | 0.9980 | 0.4666 | 0.2448 | 0.4082 | 0.5250 | 0.9069 | 0.2101 | 0.8568 | 0.9570 | 0.5792 | 0.2200 | 0.5267 | 0.6316 | 0.9420 | 0.0332 | 0.9341 | 0.9499 |
| Hotpep | 0.9013 | 0.0565 | 0.8878 | 0.9148 | 0.7851 | 0.2226 | 0.7320 | 0.8381 | 0.4862 | 0.1634 | 0.4473 | 0.5252 | 0.5823 | 0.1646 | 0.5430 | 0.6215 | 0.8898 | 0.0560 | 0.8765 | 0.9032 |
Figure 4.40: Summary statistics of CAZyme classifiers performances of CBM class classification, plotting the mean plus and minus the 95% confidence interval.
Figure 4.41: One dimensional scatter plot of the specificity per test set for the classification of CBM class members, overlaying a box plot
Figure 4.42: One dimensional scatter plot of the sensitivity per test set for the classification of CBM class members, overlaying a box plot
Figure 4.43: One dimensional scatter plot of the precision per test set for the classification of CBM class members, overlaying a box plot
Figure 4.44: One dimensional scatter plot of the F1-score per test set for the classification of CBM class members, overlaying a box plot
Figure 4.45: One dimensional scatter plot of the accuracy per test set for the classification of CBM class members, overlaying a box plot
A single CAZyme can be included in multiple CAZy classes leading to the multilabel classification of CAZymes. To address this and evaluate the multilabel classification of CAZy classes the Rand Index (RI) and Adjusted Rand Index (ARI) were calculated.
The RI is the measure of accuracy across all potential classifications of a protein. The RI ranges from 0 (no correct annotations) to 1 (all annotations correct). The ARI is the RI adjusted for chance, where 0 is the equivalent to assigning the CAZy class annotations randomly, -1 where the annotations are systematically handed out incorrectly and 1 where the annotations are all correct.
| Prediction_tool | Lower CI | Mean | Upper CI | Standard Deviation |
|---|---|---|---|---|
| dbCAN | 0.9359 | 0.9398 | 0.9437 | 0.2359 |
| HMMER | 0.9226 | 0.9268 | 0.9310 | 0.2537 |
| DIAMOND | 0.9510 | 0.9545 | 0.9579 | 0.2079 |
| Hotpep | 0.8653 | 0.8706 | 0.8759 | 0.3212 |
| CUPP | 0.8960 | 0.9007 | 0.9054 | 0.2852 |
| eCAMI | 0.9013 | 0.9060 | 0.9107 | 0.2836 |
Plot are violin plots underlying scatter plots, presenting the RI and ARI for every protein across all test sets.
Figure 4.46: Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
Figure 4.47: 95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
Figure 4.48: Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
Figure 4.49: 95% confidence interval around the mean of Adjusted Rand Index (ARI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
The following section evaluates the performance of the CAZyme classifiers to predict CAZy family classifications.
Below is a table summarising the overall CAZy family classifications for each test set across all CAZy families.
| Classifier | Spec Mean | Spec Standard Deviation | Spec Lower CI | Spec Upper CI | Sens Mean | Sens Standard Deviation | Sens Lower CI | Sens Upper CI | Prec Mean | Prec Standard Deviation | Prec Lower CI | Prec Upper CI | F1-score Mean | F1-score Standard Deviation | F1-score Lower CI | F1-score Upper CI | Acc Mean | Acc Standard Deviation | Acc Lower CI | Acc Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dbCAN | 0.9999 | 3e-04 | 0.9999 | 1.0000 | 0.8874 | 0.2417 | 0.8597 | 0.9150 | 0.9309 | 0.2275 | 0.9048 | 0.9569 | 0.8997 | 0.2349 | 0.8728 | 0.9265 | 0.9995 | 0.0014 | 0.9994 | 0.9997 |
| HMMER | 0.9999 | 3e-04 | 0.9999 | 0.9999 | 0.8703 | 0.2814 | 0.8383 | 0.9022 | 0.8861 | 0.2791 | 0.8545 | 0.9178 | 0.8640 | 0.2781 | 0.8325 | 0.8956 | 0.9994 | 0.0022 | 0.9991 | 0.9996 |
| Hotpep | 0.9994 | 2e-03 | 0.9991 | 0.9996 | 0.7621 | 0.3347 | 0.7248 | 0.7993 | 0.7661 | 0.3771 | 0.7241 | 0.8081 | 0.7305 | 0.3504 | 0.6915 | 0.7695 | 0.9987 | 0.0034 | 0.9983 | 0.9991 |
| DIAMOND | 0.9999 | 3e-04 | 0.9999 | 1.0000 | 0.8927 | 0.2386 | 0.8654 | 0.9200 | 0.9268 | 0.2257 | 0.9010 | 0.9527 | 0.9025 | 0.2323 | 0.8760 | 0.9291 | 0.9997 | 0.0008 | 0.9996 | 0.9997 |
| CUPP | 1.0000 | 2e-04 | 0.9999 | 1.0000 | 0.6582 | 0.4360 | 0.6084 | 0.7081 | 0.7048 | 0.4458 | 0.6538 | 0.7558 | 0.6723 | 0.4354 | 0.6225 | 0.7221 | 0.9992 | 0.0023 | 0.9989 | 0.9994 |
| eCAMI | 0.9997 | 9e-04 | 0.9996 | 0.9998 | 0.7356 | 0.3412 | 0.6972 | 0.7739 | 0.7791 | 0.3671 | 0.7378 | 0.8203 | 0.7372 | 0.3437 | 0.6986 | 0.7758 | 0.9992 | 0.0016 | 0.9990 | 0.9994 |
The evaluate the overall performance of each classifier, for each CAZy family, the F1-score was calculated for every family. Families were grouped by their parent CAZy class and the distribution of the F1-scores is shown in figure 5.1.
Figure 5.1: Proportaional area plot of F1-score per CAZy distribution per CAZy class.
5.1 Below is a table displaying the number of test sets in which each CAZy class was present, and were used to draw the proporitonal areas for each class in figure5.1.
| Prediction_tool | GH | GT | PL | CE | AA | CBM |
|---|---|---|---|---|---|---|
| dbCAN | 124 | 70 | 22 | 16 | 14 | 50 |
| HMMER | 126 | 72 | 22 | 16 | 14 | 51 |
| DIAMOND | 124 | 70 | 22 | 16 | 14 | 50 |
| Hotpep | 125 | 70 | 22 | 16 | 14 | 65 |
| CUPP | 124 | 70 | 22 | 16 | 14 | 50 |
| eCAMI | 124 | 70 | 22 | 16 | 14 | 61 |
To evaluate the performance of predicting each CAZy family independent of all other CAZy families, the sensitivity and precision for each CAZy family, for each CAZyme classifier was calculated and plotted against each other (Fig.??). Whereas sensitivity was plotted against sensitivity for CAZy classes, owing to the extremely small variation in specificity scores, sensitivity was plotted as a percentage against log10 of the specificity percentage.
The following plots present the specificity (Fig.5.2), sensitivity (Fig.5.3), precision (Fig.5.4), F1-score (Fig.5.5) and accuracy (Fig.5.6) for each CAZy family per classifier. In accompaniment to each plot is a table summarising the mean statistic value for each classifier across all CAZy families for each CAZy class.
| CAZy_class | Prediction_tool | Mean | Standard Deviation | Lower CI | Upper CI |
|---|---|---|---|---|---|
| CBM | dbCAN | 0.9999 | 0.0003 | 0.9998 | 1.0000 |
| CBM | HMMER | 0.9999 | 0.0004 | 0.9998 | 1.0000 |
| CBM | DIAMOND | 0.9999 | 0.0002 | 0.9998 | 1.0000 |
| CBM | Hotpep | 0.9974 | 0.0038 | 0.9965 | 0.9984 |
| CBM | CUPP | 1.0000 | 0.0000 | 1.0000 | 1.0000 |
| CBM | eCAMI | 0.9989 | 0.0017 | 0.9985 | 0.9994 |
| AA | dbCAN | 0.9997 | 0.0006 | 0.9994 | 1.0001 |
| AA | HMMER | 0.9997 | 0.0006 | 0.9993 | 1.0000 |
| AA | DIAMOND | 0.9997 | 0.0007 | 0.9993 | 1.0001 |
| AA | Hotpep | 0.9997 | 0.0006 | 0.9994 | 1.0001 |
| AA | CUPP | 0.9997 | 0.0006 | 0.9994 | 1.0001 |
| AA | eCAMI | 0.9998 | 0.0005 | 0.9995 | 1.0001 |
| CE | dbCAN | 0.9997 | 0.0007 | 0.9993 | 1.0000 |
| CE | HMMER | 0.9998 | 0.0003 | 0.9996 | 0.9999 |
| CE | DIAMOND | 0.9997 | 0.0007 | 0.9993 | 1.0001 |
| CE | Hotpep | 0.9995 | 0.0007 | 0.9992 | 0.9999 |
| CE | CUPP | 0.9998 | 0.0004 | 0.9996 | 1.0000 |
| CE | eCAMI | 0.9996 | 0.0007 | 0.9992 | 1.0000 |
| PL | dbCAN | 1.0000 | 0.0000 | 1.0000 | 1.0000 |
| PL | HMMER | 1.0000 | 0.0001 | 0.9999 | 1.0000 |
| PL | DIAMOND | 1.0000 | 0.0001 | 1.0000 | 1.0000 |
| PL | Hotpep | 1.0000 | 0.0001 | 0.9999 | 1.0000 |
| PL | CUPP | 1.0000 | 0.0001 | 0.9999 | 1.0000 |
| PL | eCAMI | 1.0000 | 0.0001 | 0.9999 | 1.0000 |
| GT | dbCAN | 1.0000 | 0.0001 | 1.0000 | 1.0000 |
| GT | HMMER | 0.9999 | 0.0002 | 0.9999 | 1.0000 |
| GT | DIAMOND | 1.0000 | 0.0001 | 0.9999 | 1.0000 |
| GT | Hotpep | 1.0000 | 0.0002 | 0.9999 | 1.0000 |
| GT | CUPP | 1.0000 | 0.0001 | 1.0000 | 1.0000 |
| GT | eCAMI | 1.0000 | 0.0002 | 0.9999 | 1.0000 |
| GH | dbCAN | 1.0000 | 0.0001 | 0.9999 | 1.0000 |
| GH | HMMER | 1.0000 | 0.0002 | 0.9999 | 1.0000 |
| GH | DIAMOND | 1.0000 | 0.0001 | 0.9999 | 1.0000 |
| GH | Hotpep | 0.9998 | 0.0006 | 0.9998 | 0.9999 |
| GH | CUPP | 1.0000 | 0.0001 | 1.0000 | 1.0000 |
| GH | eCAMI | 0.9999 | 0.0003 | 0.9998 | 1.0000 |
Figure 5.2: Scatter plot of overlaying a one-dimensional box-and-whisker plot of specificity for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.
| CAZy_class | Prediction_tool | Mean | Standard Deviation | Lower CI | Upper CI |
|---|---|---|---|---|---|
| CBM | dbCAN | 0.8945 | 0.2152 | 0.8333 | 0.9556 |
| CBM | HMMER | 0.7766 | 0.3606 | 0.6752 | 0.8781 |
| CBM | DIAMOND | 0.9052 | 0.2201 | 0.8427 | 0.9678 |
| CBM | Hotpep | 0.6006 | 0.4159 | 0.4975 | 0.7037 |
| CBM | CUPP | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| CBM | eCAMI | 0.6069 | 0.3853 | 0.5082 | 0.7056 |
| AA | dbCAN | 0.8132 | 0.2928 | 0.6442 | 0.9822 |
| AA | HMMER | 0.8899 | 0.2706 | 0.7336 | 1.0461 |
| AA | DIAMOND | 0.8040 | 0.2939 | 0.6343 | 0.9737 |
| AA | Hotpep | 0.8159 | 0.2935 | 0.6464 | 0.9854 |
| AA | CUPP | 0.7194 | 0.4076 | 0.4841 | 0.9547 |
| AA | eCAMI | 0.6972 | 0.3735 | 0.4816 | 0.9129 |
| CE | dbCAN | 0.8724 | 0.2887 | 0.7186 | 1.0262 |
| CE | HMMER | 0.9244 | 0.2487 | 0.7919 | 1.0569 |
| CE | DIAMOND | 0.8481 | 0.2655 | 0.7067 | 0.9896 |
| CE | Hotpep | 0.7921 | 0.3132 | 0.6252 | 0.9589 |
| CE | CUPP | 0.8504 | 0.3356 | 0.6716 | 1.0292 |
| CE | eCAMI | 0.7749 | 0.2659 | 0.6332 | 0.9165 |
| PL | dbCAN | 0.8076 | 0.3628 | 0.6468 | 0.9685 |
| PL | HMMER | 0.8571 | 0.3137 | 0.7180 | 0.9962 |
| PL | DIAMOND | 0.8287 | 0.3489 | 0.6740 | 0.9834 |
| PL | Hotpep | 0.6768 | 0.3560 | 0.5189 | 0.8346 |
| PL | CUPP | 0.6055 | 0.4310 | 0.4144 | 0.7966 |
| PL | eCAMI | 0.6159 | 0.4221 | 0.4288 | 0.8031 |
| GT | dbCAN | 0.8586 | 0.2695 | 0.7943 | 0.9228 |
| GT | HMMER | 0.8397 | 0.3045 | 0.7682 | 0.9113 |
| GT | DIAMOND | 0.8729 | 0.2621 | 0.8104 | 0.9354 |
| GT | Hotpep | 0.7537 | 0.3274 | 0.6757 | 0.8318 |
| GT | CUPP | 0.8073 | 0.3050 | 0.7345 | 0.8800 |
| GT | eCAMI | 0.7635 | 0.3165 | 0.6880 | 0.8389 |
| GH | dbCAN | 0.9252 | 0.1885 | 0.8917 | 0.9588 |
| GH | HMMER | 0.9188 | 0.2165 | 0.8806 | 0.9570 |
| GH | DIAMOND | 0.9258 | 0.1923 | 0.8916 | 0.9600 |
| GH | Hotpep | 0.8558 | 0.2556 | 0.8106 | 0.9011 |
| GH | CUPP | 0.8172 | 0.3475 | 0.7554 | 0.8789 |
| GH | eCAMI | 0.8036 | 0.3017 | 0.7499 | 0.8572 |
Figure 5.3: Scatter plot of overlaying a one-dimensional box-and-whisker plot of sensitivity for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.
| CAZy_class | Prediction_tool | Mean | Standard Deviation | Lower CI | Upper CI |
|---|---|---|---|---|---|
| CBM | dbCAN | 0.9012 | 0.2358 | 0.8341 | 0.9682 |
| CBM | HMMER | 0.8416 | 0.3438 | 0.7449 | 0.9383 |
| CBM | DIAMOND | 0.9042 | 0.2194 | 0.8418 | 0.9665 |
| CBM | Hotpep | 0.2739 | 0.2989 | 0.1999 | 0.3480 |
| CBM | CUPP | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| CBM | eCAMI | 0.4427 | 0.3508 | 0.3528 | 0.5325 |
| AA | dbCAN | 0.9149 | 0.1534 | 0.8263 | 1.0035 |
| AA | HMMER | 0.8225 | 0.2862 | 0.6572 | 0.9877 |
| AA | DIAMOND | 0.8832 | 0.1772 | 0.7809 | 0.9855 |
| AA | Hotpep | 0.9156 | 0.1526 | 0.8274 | 1.0037 |
| AA | CUPP | 0.7480 | 0.3840 | 0.5263 | 0.9698 |
| AA | eCAMI | 0.7846 | 0.3565 | 0.5787 | 0.9904 |
| CE | dbCAN | 0.8256 | 0.3191 | 0.6556 | 0.9957 |
| CE | HMMER | 0.8026 | 0.2910 | 0.6475 | 0.9576 |
| CE | DIAMOND | 0.8379 | 0.3034 | 0.6763 | 0.9996 |
| CE | Hotpep | 0.7979 | 0.3115 | 0.6320 | 0.9639 |
| CE | CUPP | 0.8336 | 0.3361 | 0.6545 | 1.0127 |
| CE | eCAMI | 0.8144 | 0.3100 | 0.6492 | 0.9796 |
| PL | dbCAN | 0.8636 | 0.3513 | 0.7079 | 1.0194 |
| PL | HMMER | 0.8538 | 0.3256 | 0.7094 | 0.9982 |
| PL | DIAMOND | 0.8506 | 0.3513 | 0.6949 | 1.0064 |
| PL | Hotpep | 0.8628 | 0.3509 | 0.7072 | 1.0184 |
| PL | CUPP | 0.7154 | 0.4494 | 0.5162 | 0.9147 |
| PL | eCAMI | 0.7240 | 0.4539 | 0.5227 | 0.9252 |
| GT | dbCAN | 0.9418 | 0.2336 | 0.8861 | 0.9975 |
| GT | HMMER | 0.8765 | 0.3003 | 0.8059 | 0.9470 |
| GT | DIAMOND | 0.9400 | 0.2335 | 0.8843 | 0.9957 |
| GT | Hotpep | 0.8917 | 0.3018 | 0.8197 | 0.9636 |
| GT | CUPP | 0.9054 | 0.2815 | 0.8383 | 0.9726 |
| GT | eCAMI | 0.8950 | 0.3014 | 0.8232 | 0.9669 |
| GH | dbCAN | 0.9639 | 0.1776 | 0.9324 | 0.9955 |
| GH | HMMER | 0.9331 | 0.2176 | 0.8947 | 0.9714 |
| GH | DIAMOND | 0.9585 | 0.1823 | 0.9261 | 0.9909 |
| GH | Hotpep | 0.9138 | 0.2502 | 0.8695 | 0.9581 |
| GH | CUPP | 0.8524 | 0.3452 | 0.7911 | 0.9138 |
| GH | eCAMI | 0.8837 | 0.2975 | 0.8308 | 0.9366 |
Figure 5.4: Scatter plot of overlaying a one-dimensional box-and-whisker plot of precision for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.
| CAZy_class | Prediction_tool | Mean | Standard Deviation | Lower CI | Upper CI |
|---|---|---|---|---|---|
| CBM | dbCAN | 0.8863 | 0.2176 | 0.8245 | 0.9482 |
| CBM | HMMER | 0.7755 | 0.3559 | 0.6754 | 0.8756 |
| CBM | DIAMOND | 0.8980 | 0.2136 | 0.8373 | 0.9587 |
| CBM | Hotpep | 0.3402 | 0.3115 | 0.2630 | 0.4174 |
| CBM | CUPP | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
| CBM | eCAMI | 0.4819 | 0.3354 | 0.3960 | 0.5678 |
| AA | dbCAN | 0.8173 | 0.2539 | 0.6707 | 0.9639 |
| AA | HMMER | 0.8479 | 0.2701 | 0.6919 | 1.0038 |
| AA | DIAMOND | 0.8119 | 0.2539 | 0.6653 | 0.9584 |
| AA | Hotpep | 0.8187 | 0.2537 | 0.6723 | 0.9652 |
| AA | CUPP | 0.7123 | 0.3913 | 0.4864 | 0.9383 |
| AA | eCAMI | 0.7102 | 0.3681 | 0.4976 | 0.9227 |
| CE | dbCAN | 0.8408 | 0.2993 | 0.6813 | 1.0003 |
| CE | HMMER | 0.8443 | 0.2639 | 0.7036 | 0.9849 |
| CE | DIAMOND | 0.8357 | 0.2800 | 0.6865 | 0.9849 |
| CE | Hotpep | 0.7720 | 0.2980 | 0.6132 | 0.9307 |
| CE | CUPP | 0.8388 | 0.3310 | 0.6625 | 1.0152 |
| CE | eCAMI | 0.7791 | 0.2716 | 0.6344 | 0.9238 |
| PL | dbCAN | 0.8263 | 0.3540 | 0.6693 | 0.9832 |
| PL | HMMER | 0.8363 | 0.3085 | 0.6995 | 0.9731 |
| PL | DIAMOND | 0.8372 | 0.3471 | 0.6832 | 0.9911 |
| PL | Hotpep | 0.7390 | 0.3413 | 0.5877 | 0.8903 |
| PL | CUPP | 0.6396 | 0.4241 | 0.4515 | 0.8276 |
| PL | eCAMI | 0.6549 | 0.4277 | 0.4652 | 0.8445 |
| GT | dbCAN | 0.8869 | 0.2587 | 0.8252 | 0.9486 |
| GT | HMMER | 0.8502 | 0.2961 | 0.7806 | 0.9198 |
| GT | DIAMOND | 0.8958 | 0.2563 | 0.8347 | 0.9569 |
| GT | Hotpep | 0.7995 | 0.3111 | 0.7253 | 0.8736 |
| GT | CUPP | 0.8411 | 0.2886 | 0.7723 | 0.9099 |
| GT | eCAMI | 0.8110 | 0.3055 | 0.7382 | 0.8838 |
| GH | dbCAN | 0.9422 | 0.1807 | 0.9101 | 0.9743 |
| GH | HMMER | 0.9169 | 0.2164 | 0.8787 | 0.9550 |
| GH | DIAMOND | 0.9386 | 0.1840 | 0.9059 | 0.9713 |
| GH | Hotpep | 0.8782 | 0.2473 | 0.8344 | 0.9219 |
| GH | CUPP | 0.8280 | 0.3454 | 0.7666 | 0.8894 |
| GH | eCAMI | 0.8333 | 0.2931 | 0.7812 | 0.8854 |
Figure 5.5: Scatter plot of overlaying a one-dimensional box-and-whisker plot of the F1-score for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.
| CAZy_class | Prediction_tool | Mean | Standard Deviation | Lower CI | Upper CI |
|---|---|---|---|---|---|
| CBM | dbCAN | 0.9994 | 0.0016 | 0.9990 | 0.9999 |
| CBM | HMMER | 0.9988 | 0.0036 | 0.9978 | 0.9998 |
| CBM | DIAMOND | 0.9996 | 0.0009 | 0.9994 | 0.9999 |
| CBM | Hotpep | 0.9970 | 0.0042 | 0.9960 | 0.9981 |
| CBM | CUPP | 0.9976 | 0.0043 | 0.9964 | 0.9988 |
| CBM | eCAMI | 0.9985 | 0.0022 | 0.9979 | 0.9991 |
| AA | dbCAN | 0.9995 | 0.0008 | 0.9990 | 0.9999 |
| AA | HMMER | 0.9994 | 0.0009 | 0.9989 | 1.0000 |
| AA | DIAMOND | 0.9994 | 0.0008 | 0.9989 | 0.9999 |
| AA | Hotpep | 0.9995 | 0.0007 | 0.9991 | 0.9999 |
| AA | CUPP | 0.9994 | 0.0007 | 0.9990 | 0.9998 |
| AA | eCAMI | 0.9993 | 0.0008 | 0.9988 | 0.9998 |
| CE | dbCAN | 0.9996 | 0.0009 | 0.9990 | 1.0001 |
| CE | HMMER | 0.9995 | 0.0007 | 0.9991 | 0.9999 |
| CE | DIAMOND | 0.9994 | 0.0010 | 0.9989 | 0.9999 |
| CE | Hotpep | 0.9993 | 0.0010 | 0.9987 | 0.9998 |
| CE | CUPP | 0.9996 | 0.0006 | 0.9993 | 0.9999 |
| CE | eCAMI | 0.9992 | 0.0011 | 0.9986 | 0.9998 |
| PL | dbCAN | 0.9999 | 0.0002 | 0.9998 | 1.0000 |
| PL | HMMER | 0.9999 | 0.0003 | 0.9997 | 1.0000 |
| PL | DIAMOND | 0.9999 | 0.0002 | 0.9998 | 1.0000 |
| PL | Hotpep | 0.9998 | 0.0003 | 0.9997 | 0.9999 |
| PL | CUPP | 0.9998 | 0.0003 | 0.9997 | 0.9999 |
| PL | eCAMI | 0.9997 | 0.0004 | 0.9996 | 0.9999 |
| GT | dbCAN | 0.9993 | 0.0019 | 0.9989 | 0.9998 |
| GT | HMMER | 0.9992 | 0.0023 | 0.9987 | 0.9998 |
| GT | DIAMOND | 0.9995 | 0.0010 | 0.9993 | 0.9998 |
| GT | Hotpep | 0.9985 | 0.0052 | 0.9973 | 0.9998 |
| GT | CUPP | 0.9992 | 0.0018 | 0.9988 | 0.9997 |
| GT | eCAMI | 0.9991 | 0.0020 | 0.9986 | 0.9996 |
| GH | dbCAN | 0.9997 | 0.0011 | 0.9995 | 0.9999 |
| GH | HMMER | 0.9996 | 0.0015 | 0.9993 | 0.9999 |
| GH | DIAMOND | 0.9998 | 0.0006 | 0.9996 | 0.9999 |
| GH | Hotpep | 0.9994 | 0.0015 | 0.9991 | 0.9997 |
| GH | CUPP | 0.9996 | 0.0013 | 0.9994 | 0.9999 |
| GH | eCAMI | 0.9994 | 0.0011 | 0.9993 | 0.9996 |
Figure 5.6: Scatter plot of overlaying a one-dimensional box-and-whisker plot of the accuracy for each CAZy family for each CAZyme classifier. Each CAZy family is represented as a single point on the plot.
For better resolution we can group the CAZy families by their parent CAzy classes, and compare the performances of the tools CAZy class, by CAZy class. Owing to the minimal variation in specificity scores, specificity was plotted as the percentage specificity log10.
Figure 5.7 shows the plotting of sensitivity against specificity for each Glycoside Hydrolase CAZy family.
Figure 5.7: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycoside Hydrolases. Each GH CAZy family is represented as a single point on the plot.
Figure 5.8 shows the plotting of sensitivity against specificity for each Glycosyltransferases CAZy family.
Figure 5.8: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycosyltransferases. Each GT CAZy family is represented as a single point on the plot.
Figure 5.7 shows the plotting of sensitivity against specificity for each Polysaccharide Lyases CAZy family.
Figure 5.9: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Polysaccharide Lyases. Each PL CAZy family is represented as a single point on the plot.
Figure 5.10 shows the plotting of sensitivity against specificity for each Carbohydrate Esterases CAZy family.
Figure 5.10: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Esterases. Each CE CAZy family is represented as a single point on the plot.
Figure ?? shows the plotting of sensitivity against specificity for each Auxillary Activities CAZy family.
Figure 5.11: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Auxillary Activities. Each AA CAZy family is represented as a single point on the plot.
Figure 5.12 shows the plotting of sensitivity against specificity for each Carbohydrate Binding Module CAZy family.
Figure 5.12: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Binding Modules. Each CBM CAZy family is represented as a single point on the plot.
We then pulled out the CAZy families with which at least three classifiers produced a sensitivity score of less than 0.75.
CAZy annotates proteins in a domain-wise manner. Consequently, a single protein may be assigned to multiple CAZy families. The ability of a classifier to assign all the correct CAZy family annotations for a given protein when only evaluating the CAZy family classification performance per CAZy family, independently of all other CAZy classes.
The CAZy family multi-label classification performance is represented by the Rand Index (RI) and Adjusted Rand Index (ARI). The RI is a quantitive measure of similarity between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings. In this case the two clusters are the predicted and groud truth CAZy family annotations. The raw RI score is then “adjusted for chance” into the ARI score using the following scheme:
ARI = (RI - Expected_RI) / (max(RI) - Expected_RI)
This produces a score between 1 and -1. A score of 1 is produced if all predicted and known CAZy family annotations are identical, 0 if completely random clustering of -1 if systematically incorrect clustering and the number of incorrect classifications of proteins is greater than would be expected from randomly annotating proteins with CAZy families.
| Prediction_tool | Mean | Standard Deviation | Lower CI | Upper CI |
|---|---|---|---|---|
| dbCAN | 0.9997 | 0.0011 | 0.9997 | 0.9997 |
| HMMER | 0.9996 | 0.0014 | 0.9996 | 0.9996 |
| DIAMOND | 0.9998 | 0.0010 | 0.9998 | 0.9998 |
| Hotpep | 0.9991 | 0.0023 | 0.9991 | 0.9991 |
| CUPP | 0.9995 | 0.0015 | 0.9994 | 0.9995 |
| eCAMI | 0.9994 | 0.0017 | 0.9994 | 0.9995 |
| Prediction_tool | Mean | Standard Deviation | Lower CI | Upper CI |
|---|---|---|---|---|
| dbCAN | 0.9391 | 0.2359 | 0.9352 | 0.9430 |
| HMMER | 0.9250 | 0.2554 | 0.9208 | 0.9292 |
| DIAMOND | 0.9530 | 0.2105 | 0.9495 | 0.9565 |
| Hotpep | 0.8758 | 0.3083 | 0.8707 | 0.8809 |
| CUPP | 0.9098 | 0.2712 | 0.9053 | 0.9143 |
| eCAMI | 0.9077 | 0.2778 | 0.9031 | 0.9123 |
Multilabel classification raises when a single instance can be assinged to multiple classes. In this evaluation a single instance is a protein and the classes are CAZy families, a single CAZyme can be assigned to multiple CAZy families. This is important to take into consideration because the same approaches for statistical evaluation of binary classification provided a limited view of the performance of the classifiers when applied to multilabel classification.
Plot are violin plots overlayed by scatter plots of the Rand Index and Adjusted Rand Index for every protein in every test set, excluding true negatives.
Figure 5.13: Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
Figure 5.14: 95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
Figure 5.15: Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
Figure 5.16: 95% confidence interval around the mean of Adjusted Rand Index (ARI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy families
The performance for a classifier per taxonomy group may vary. For this evaluation the test sets were separated into the taxonomy groups: - Bacteria - Eukaryote
The evaluation per classifier per taxonomy group, versus all test sets pooled together was evaluated.
Here we calculate the mean plus and minus the standard deviation of the F1-score of each prediction tool for each taxonomy group, to represent the overall performance per taxonomy group.
| Prediction_tool | Bact Mean | Bact Standard Deviation | Bact Lower CI | Bact Upper CI | Euk Mean | Euk Standard Deviation | Euk Lower CI | Euk Upper CI | All Mean | All Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9217 | 0.0522 | 0.9048 | 0.9386 | 0.9103 | 0.0545 | 0.8903 | 0.9303 | 0.9167 | 0.0531 | 0.9040 | 0.9293 |
| dbCAN | 0.9434 | 0.0782 | 0.9180 | 0.9687 | 0.9385 | 0.0826 | 0.9082 | 0.9688 | 0.9412 | 0.0796 | 0.9222 | 0.9602 |
| DIAMOND | 0.9481 | 0.0919 | 0.9183 | 0.9779 | 0.9480 | 0.0908 | 0.9147 | 0.9813 | 0.9481 | 0.0907 | 0.9264 | 0.9697 |
| eCAMI | 0.9270 | 0.0763 | 0.9023 | 0.9518 | 0.8913 | 0.0960 | 0.8560 | 0.9265 | 0.9112 | 0.0868 | 0.8905 | 0.9319 |
| HMMER | 0.9210 | 0.0791 | 0.8953 | 0.9466 | 0.9425 | 0.0215 | 0.9346 | 0.9503 | 0.9305 | 0.0613 | 0.9159 | 0.9451 |
| Hotpep | 0.8898 | 0.0774 | 0.8647 | 0.9149 | 0.8817 | 0.1083 | 0.8420 | 0.9214 | 0.8862 | 0.0917 | 0.8643 | 0.9081 |
Figure 6.1: 95% confidence interval around the mean F1-score of the binary classification of CAZymes and non-CAZymes per taxonomic group.
Figure 6.2: One dimensional scatter plot overlaying a box and whisker plot of the specificity of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.
Figure 6.3: One dimensional scatter plot overlaying a box and whisker plot of the sensitivity of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.
Figure 6.4: One dimensional scatter plot overlaying a box and whisker plot of the precision of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.
Figure 6.5: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.
Figure 6.6: One dimensional scatter plot overlaying a box and whisker plot of the accuracy of binary classification per CAZyme/non-CAZyme classifier per taxonomy group. Each point represents the score from one test set.
Below a table containing the mean F1-score plus/minus standard deviation for per CAZyme classifier per taxonomy group is presented, in order to represent the overall performance per CAZyme classifier per taxonomy group for all CAZy class classification.
| Prediction_tool | Bact Mean | Bact Standard Deviation | Bact Lower CI | Bact Upper CI | Euk Mean | Euk Standard Deviation | Euk Lower CI | Euk Upper CI | All Mean | All Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9217 | 0.0522 | 0.9048 | 0.9386 | 0.7370 | 0.3863 | 0.6765 | 0.7975 | 0.9170 | 0.1460 | 0.9017 | 0.9323 |
| dbCAN | 0.9434 | 0.0782 | 0.9180 | 0.9687 | 0.9139 | 0.1394 | 0.8921 | 0.9358 | 0.8675 | 0.2013 | 0.8464 | 0.8886 |
| DIAMOND | 0.9481 | 0.0919 | 0.9183 | 0.9779 | 0.9110 | 0.1924 | 0.8808 | 0.9411 | 0.9213 | 0.1725 | 0.9032 | 0.9394 |
| eCAMI | 0.9270 | 0.0763 | 0.9023 | 0.9518 | 0.8340 | 0.2013 | 0.8025 | 0.8656 | 0.8207 | 0.2116 | 0.7985 | 0.8429 |
| HMMER | 0.9210 | 0.0791 | 0.8953 | 0.9466 | 0.8638 | 0.2186 | 0.8296 | 0.8981 | 0.7343 | 0.3937 | 0.6930 | 0.7756 |
| Hotpep | 0.8898 | 0.0774 | 0.8647 | 0.9149 | 0.8148 | 0.2185 | 0.7806 | 0.8490 | 0.8487 | 0.1950 | 0.8282 | 0.8691 |
Figure 6.7: 95% confidence interval around the mean F1-score of the classification of CAZy classes per taxonomic group.
To evaluate the difference between the taxonomic kingdoms per CAZy class, the data was separated into each of the CAZy classes. The F1-score was then plotted as a one-dimensional scatter plot overlaying a boxplot, with data grouped by the taxonomic kingdom and facet wrapped by classifier.
Figure @ref{fig:ghClassTax} plots a summary the difference in performance between bacterial and eukaryota GH class members.
Overall, the classifiers demonstrated similar performances between the bacterial and eukaryotic test sets. eCAMI showed the greater difference in performance between bacteria and eukaryotes, demonstrating a more consistent perforamnce against bacterial proteins, as inferred from the smaller interquartile range.
Figure 6.8: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying GH class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.
The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).
| Classifier | Mean Bacteria Specificity | Bacteria Specificity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Sensitivity | Bacteria Sensitivity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Precision | Bacteria Precision Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria F1-score | Bacteria F1-score Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Accuracy | Bacteria Accuracy Standard Deviation | Bacteria Lower CI | Bacteria Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9927 | 0.0151 | 0.9878 | 0.9976 | 0.9202 | 0.0786 | 0.8948 | 0.9457 | 0.9938 | 0.0132 | 0.9895 | 0.9981 | 0.9536 | 0.0469 | 0.9384 | 0.9688 | 0.9563 | 0.0323 | 0.9458 | 0.9668 |
| dbCAN | 0.9942 | 0.0138 | 0.9897 | 0.9986 | 0.9281 | 0.1014 | 0.8952 | 0.9609 | 0.9924 | 0.0233 | 0.9848 | 0.9999 | 0.9555 | 0.0691 | 0.9331 | 0.9779 | 0.9613 | 0.0401 | 0.9483 | 0.9743 |
| DIAMOND | 0.9894 | 0.0174 | 0.9838 | 0.9951 | 0.9439 | 0.1112 | 0.9079 | 0.9800 | 0.9878 | 0.0287 | 0.9785 | 0.9971 | 0.9608 | 0.0737 | 0.9369 | 0.9847 | 0.9670 | 0.0468 | 0.9518 | 0.9821 |
| eCAMI | 0.9813 | 0.0244 | 0.9734 | 0.9892 | 0.9205 | 0.1000 | 0.8881 | 0.9529 | 0.9823 | 0.0256 | 0.9740 | 0.9906 | 0.9469 | 0.0585 | 0.9280 | 0.9659 | 0.9504 | 0.0484 | 0.9347 | 0.9661 |
| HMMER | 0.9933 | 0.0136 | 0.9889 | 0.9977 | 0.9106 | 0.1074 | 0.8758 | 0.9454 | 0.9926 | 0.0169 | 0.9871 | 0.9980 | 0.9457 | 0.0762 | 0.9210 | 0.9704 | 0.9527 | 0.0430 | 0.9388 | 0.9666 |
| Hotpep | 0.9763 | 0.0274 | 0.9674 | 0.9851 | 0.9070 | 0.0890 | 0.8782 | 0.9359 | 0.9755 | 0.0359 | 0.9638 | 0.9871 | 0.9375 | 0.0561 | 0.9193 | 0.9557 | 0.9425 | 0.0402 | 0.9295 | 0.9556 |
| Classifier | Mean Eukaryote Specificity | Eukaryote Specificity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Sensitivity | Eukaryote Sensitivity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Precision | Eukaryote Precision Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote F1-score | Eukaryote F1-score Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Accuracy | Eukaryote Accuracy Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9990 | 0.0038 | 0.9976 | 1.0004 | 0.9053 | 0.0532 | 0.8858 | 0.9248 | 0.9981 | 0.0075 | 0.9953 | 1.0008 | 0.9486 | 0.0302 | 0.9375 | 0.9597 | 0.9603 | 0.0213 | 0.9525 | 0.9681 |
| dbCAN | 0.9983 | 0.0053 | 0.9964 | 1.0003 | 0.9118 | 0.1032 | 0.8739 | 0.9496 | 0.9976 | 0.0077 | 0.9947 | 1.0004 | 0.9493 | 0.0630 | 0.9262 | 0.9725 | 0.9640 | 0.0362 | 0.9507 | 0.9773 |
| DIAMOND | 0.9983 | 0.0053 | 0.9964 | 1.0003 | 0.9458 | 0.0995 | 0.9093 | 0.9823 | 0.9976 | 0.0076 | 0.9948 | 1.0004 | 0.9678 | 0.0634 | 0.9445 | 0.9911 | 0.9773 | 0.0330 | 0.9652 | 0.9893 |
| eCAMI | 0.9978 | 0.0073 | 0.9952 | 1.0005 | 0.8366 | 0.1063 | 0.7976 | 0.8756 | 0.9968 | 0.0103 | 0.9930 | 1.0005 | 0.9056 | 0.0686 | 0.8805 | 0.9308 | 0.9321 | 0.0376 | 0.9183 | 0.9459 |
| HMMER | 0.9986 | 0.0044 | 0.9970 | 1.0002 | 0.9207 | 0.0362 | 0.9074 | 0.9340 | 0.9967 | 0.0106 | 0.9928 | 1.0005 | 0.9568 | 0.0196 | 0.9496 | 0.9640 | 0.9658 | 0.0161 | 0.9599 | 0.9717 |
| Hotpep | 0.9967 | 0.0083 | 0.9936 | 0.9997 | 0.8517 | 0.1278 | 0.8049 | 0.8986 | 0.9952 | 0.0116 | 0.9909 | 0.9994 | 0.9122 | 0.0802 | 0.8828 | 0.9416 | 0.9376 | 0.0455 | 0.9209 | 0.9542 |
| Classifier | Mean All Specificity | All Specificity Standard Deviation | All Lower CI | All Upper CI | Mean All Sensitivity | All Sensitivity Standard Deviation | All Lower CI | All Upper CI | Mean All Precision | All Precision Standard Deviation | All Lower CI | All Upper CI | Mean All F1-score | All F1-score Standard Deviation | All Lower CI | All Upper CI | Mean All Accuracy | All Accuracy Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9955 | 0.0119 | 0.9927 | 0.9983 | 0.9136 | 0.0685 | 0.8973 | 0.9300 | 0.9957 | 0.0112 | 0.9930 | 0.9983 | 0.9514 | 0.0402 | 0.9418 | 0.9609 | 0.9581 | 0.0279 | 0.9514 | 0.9647 |
| dbCAN | 0.9960 | 0.0110 | 0.9934 | 0.9986 | 0.9209 | 0.1017 | 0.8966 | 0.9451 | 0.9947 | 0.0182 | 0.9903 | 0.9990 | 0.9527 | 0.0661 | 0.9370 | 0.9685 | 0.9625 | 0.0382 | 0.9534 | 0.9716 |
| DIAMOND | 0.9934 | 0.0141 | 0.9900 | 0.9967 | 0.9447 | 0.1054 | 0.9196 | 0.9699 | 0.9921 | 0.0224 | 0.9868 | 0.9975 | 0.9639 | 0.0689 | 0.9475 | 0.9803 | 0.9715 | 0.0413 | 0.9617 | 0.9814 |
| eCAMI | 0.9886 | 0.0205 | 0.9837 | 0.9935 | 0.8834 | 0.1104 | 0.8570 | 0.9097 | 0.9887 | 0.0214 | 0.9836 | 0.9938 | 0.9286 | 0.0660 | 0.9129 | 0.9444 | 0.9423 | 0.0446 | 0.9317 | 0.9529 |
| HMMER | 0.9957 | 0.0108 | 0.9931 | 0.9982 | 0.9151 | 0.0834 | 0.8952 | 0.9350 | 0.9944 | 0.0145 | 0.9909 | 0.9978 | 0.9506 | 0.0583 | 0.9367 | 0.9645 | 0.9585 | 0.0342 | 0.9503 | 0.9667 |
| Hotpep | 0.9853 | 0.0234 | 0.9797 | 0.9909 | 0.8825 | 0.1106 | 0.8562 | 0.9089 | 0.9842 | 0.0294 | 0.9772 | 0.9912 | 0.9263 | 0.0685 | 0.9100 | 0.9426 | 0.9403 | 0.0424 | 0.9302 | 0.9504 |
The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.9), sensitivity (6.10), precision (6.11), F1-score (6.12), and accuracy (6.13).
Figure 6.9: One dimensional scatter plot of the specificity per test set for the classification of GH class members, overlaying a box plot
Figure 6.10: One dimensional scatter plot of the sensitivity per test set for the classification of GH class members, overlaying a box plot
Figure 6.11: One dimensional scatter plot of the precision per test set for the classification of GH class members, overlaying a box plot
Figure 6.12: One dimensional scatter plot of the F1-score per test set for the classification of GH class members, overlaying a box plot
Figure 6.13: One dimensional scatter plot of the accuracy per test set for the classification of GH class members, overlaying a box plot
Figure @ref{fig:gtClassTax} plots the difference in performance between bacterial and eukaryota GT class members. Hotpep demonstrates the greatest difference in performance between bacteria and eukaryotes, with a more consistent performance for eukaryotes as inferred from the smaller interquartile ranage. Otherwise, there was not significant difference between performance against the two kingdoms.
Figure 6.14: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying GT class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.
The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).
| Classifier | Mean Bacteria Specificity | Bacteria Specificity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Sensitivity | Bacteria Sensitivity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Precision | Bacteria Precision Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria F1-score | Bacteria F1-score Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Accuracy | Bacteria Accuracy Standard Deviation | Bacteria Lower CI | Bacteria Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9989 | 0.0051 | 0.9972 | 1.0005 | 0.8767 | 0.1167 | 0.8389 | 0.9146 | 0.9985 | 0.0069 | 0.9963 | 1.0008 | 0.9292 | 0.0753 | 0.9048 | 0.9536 | 0.9584 | 0.0518 | 0.9416 | 0.9752 |
| dbCAN | 0.9996 | 0.0027 | 0.9987 | 1.0004 | 0.8777 | 0.1416 | 0.8318 | 0.9236 | 0.9994 | 0.0038 | 0.9982 | 1.0006 | 0.9273 | 0.0997 | 0.8950 | 0.9597 | 0.9573 | 0.0666 | 0.9357 | 0.9789 |
| DIAMOND | 0.9986 | 0.0063 | 0.9965 | 1.0006 | 0.9324 | 0.1543 | 0.8824 | 0.9825 | 0.9985 | 0.0072 | 0.9962 | 1.0008 | 0.9557 | 0.1105 | 0.9199 | 0.9916 | 0.9738 | 0.0718 | 0.9505 | 0.9971 |
| eCAMI | 0.9977 | 0.0090 | 0.9948 | 1.0006 | 0.8589 | 0.1683 | 0.8044 | 0.9135 | 0.9976 | 0.0084 | 0.9949 | 1.0004 | 0.9134 | 0.1121 | 0.8771 | 0.9498 | 0.9526 | 0.0647 | 0.9316 | 0.9736 |
| HMMER | 0.9996 | 0.0024 | 0.9988 | 1.0004 | 0.8485 | 0.1274 | 0.8072 | 0.8898 | 0.9993 | 0.0044 | 0.9978 | 1.0007 | 0.9114 | 0.0950 | 0.8806 | 0.9422 | 0.9467 | 0.0675 | 0.9249 | 0.9686 |
| Hotpep | 0.9985 | 0.0046 | 0.9970 | 1.0000 | 0.6810 | 0.1931 | 0.6184 | 0.7435 | 0.9956 | 0.0139 | 0.9911 | 1.0001 | 0.7922 | 0.1493 | 0.7438 | 0.8406 | 0.8940 | 0.0798 | 0.8682 | 0.9199 |
| Classifier | Mean Eukaryote Specificity | Eukaryote Specificity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Sensitivity | Eukaryote Sensitivity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Precision | Eukaryote Precision Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote F1-score | Eukaryote F1-score Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Accuracy | Eukaryote Accuracy Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9971 | 0.0103 | 0.9934 | 1.0009 | 0.8518 | 0.1217 | 0.8071 | 0.8964 | 0.9954 | 0.0155 | 0.9897 | 1.0011 | 0.9129 | 0.0770 | 0.8846 | 0.9411 | 0.9378 | 0.0642 | 0.9143 | 0.9614 |
| dbCAN | 0.9983 | 0.0093 | 0.9949 | 1.0017 | 0.8889 | 0.1384 | 0.8382 | 0.9397 | 0.9980 | 0.0112 | 0.9939 | 1.0021 | 0.9333 | 0.0980 | 0.8973 | 0.9692 | 0.9520 | 0.0807 | 0.9223 | 0.9816 |
| DIAMOND | 0.9965 | 0.0108 | 0.9926 | 1.0005 | 0.9302 | 0.1429 | 0.8777 | 0.9826 | 0.9948 | 0.0160 | 0.9889 | 1.0006 | 0.9540 | 0.0999 | 0.9173 | 0.9906 | 0.9656 | 0.0835 | 0.9350 | 0.9962 |
| eCAMI | 0.9983 | 0.0093 | 0.9949 | 1.0017 | 0.8454 | 0.1579 | 0.7875 | 0.9033 | 0.9979 | 0.0115 | 0.9937 | 1.0021 | 0.9060 | 0.1112 | 0.8652 | 0.9468 | 0.9280 | 0.0953 | 0.8931 | 0.9630 |
| HMMER | 0.9957 | 0.0139 | 0.9906 | 1.0008 | 0.9076 | 0.0654 | 0.8837 | 0.9316 | 0.9963 | 0.0128 | 0.9916 | 1.0010 | 0.9486 | 0.0367 | 0.9351 | 0.9620 | 0.9640 | 0.0240 | 0.9552 | 0.9727 |
| Hotpep | 0.9983 | 0.0093 | 0.9949 | 1.0017 | 0.7811 | 0.1705 | 0.7185 | 0.8436 | 0.9977 | 0.0125 | 0.9932 | 1.0023 | 0.8644 | 0.1264 | 0.8181 | 0.9108 | 0.9066 | 0.1021 | 0.8691 | 0.9441 |
| Classifier | Mean All Specificity | All Specificity Standard Deviation | All Lower CI | All Upper CI | Mean All Sensitivity | All Sensitivity Standard Deviation | All Lower CI | All Upper CI | Mean All Precision | All Precision Standard Deviation | All Lower CI | All Upper CI | Mean All F1-score | All F1-score Standard Deviation | All Lower CI | All Upper CI | Mean All Accuracy | All Accuracy Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9981 | 0.0078 | 0.9962 | 1.0000 | 0.8657 | 0.1188 | 0.8374 | 0.8940 | 0.9971 | 0.0115 | 0.9944 | 0.9999 | 0.9220 | 0.0759 | 0.9039 | 0.9401 | 0.9493 | 0.0581 | 0.9354 | 0.9632 |
| dbCAN | 0.9990 | 0.0065 | 0.9975 | 1.0006 | 0.8827 | 0.1393 | 0.8495 | 0.9159 | 0.9988 | 0.0080 | 0.9969 | 1.0007 | 0.9300 | 0.0983 | 0.9065 | 0.9534 | 0.9549 | 0.0727 | 0.9376 | 0.9722 |
| DIAMOND | 0.9977 | 0.0086 | 0.9956 | 0.9997 | 0.9314 | 0.1483 | 0.8961 | 0.9668 | 0.9968 | 0.0120 | 0.9940 | 0.9997 | 0.9550 | 0.1052 | 0.9299 | 0.9800 | 0.9702 | 0.0768 | 0.9519 | 0.9885 |
| eCAMI | 0.9980 | 0.0090 | 0.9958 | 1.0002 | 0.8529 | 0.1627 | 0.8141 | 0.8917 | 0.9978 | 0.0098 | 0.9954 | 1.0001 | 0.9101 | 0.1109 | 0.8837 | 0.9366 | 0.9417 | 0.0800 | 0.9226 | 0.9608 |
| HMMER | 0.9979 | 0.0096 | 0.9956 | 1.0002 | 0.8747 | 0.1080 | 0.8489 | 0.9005 | 0.9980 | 0.0092 | 0.9958 | 1.0002 | 0.9279 | 0.0768 | 0.9095 | 0.9462 | 0.9544 | 0.0532 | 0.9417 | 0.9671 |
| Hotpep | 0.9984 | 0.0070 | 0.9967 | 1.0001 | 0.7253 | 0.1889 | 0.6802 | 0.7703 | 0.9966 | 0.0132 | 0.9934 | 0.9997 | 0.8242 | 0.1433 | 0.7900 | 0.8584 | 0.8996 | 0.0899 | 0.8782 | 0.9210 |
The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.15), sensitivity (6.16), precision (6.17), F1-score (6.18), and accuracy (6.19).
Figure 6.15: One dimensional scatter plot of the specificity per test set for the classification of GT class members, overlaying a box plot
Figure 6.16: One dimensional scatter plot of the sensitivity per test set for the classification of GT class members, overlaying a box plot
Figure 6.17: One dimensional scatter plot of the precision per test set for the classification of GT class members, overlaying a box plot
Figure 6.18: One dimensional scatter plot of the F1-score per test set for the classification of GT class members, overlaying a box plot
Figure 6.19: One dimensional scatter plot of the accuracy per test set for the classification of GT class members, overlaying a box plot
Figure @ref{fig:plClassTax} plots the difference in performance between bacterial and eukaryota PL class members. Most classifiers showed a strong consistency in performance between the bacterial and eukaryotic test sets (as inferred from the small interquartile ranges), except eCAMI which showed a signficantly greater range in performance when classifying bacterial proteins.
Figure 6.20: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying PL class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.
The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).
| Classifier | Mean Bacteria Specificity | Bacteria Specificity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Sensitivity | Bacteria Sensitivity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Precision | Bacteria Precision Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria F1-score | Bacteria F1-score Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Accuracy | Bacteria Accuracy Standard Deviation | Bacteria Lower CI | Bacteria Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9996 | 0.0020 | 0.9988 | 1.0004 | 0.7233 | 0.3707 | 0.5767 | 0.8700 | 0.8466 | 0.3608 | 0.7038 | 0.9893 | 0.7637 | 0.3562 | 0.6228 | 0.9046 | 0.9900 | 0.0162 | 0.9836 | 0.9964 |
| dbCAN | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.8210 | 0.3033 | 0.7011 | 0.9410 | 0.9259 | 0.2669 | 0.8204 | 1.0315 | 0.8572 | 0.2825 | 0.7454 | 0.9690 | 0.9941 | 0.0089 | 0.9906 | 0.9976 |
| DIAMOND | 0.9993 | 0.0027 | 0.9982 | 1.0003 | 0.8792 | 0.2539 | 0.7788 | 0.9796 | 0.9392 | 0.2122 | 0.8552 | 1.0231 | 0.8921 | 0.2297 | 0.8012 | 0.9829 | 0.9952 | 0.0075 | 0.9923 | 0.9982 |
| eCAMI | 0.9989 | 0.0032 | 0.9977 | 1.0002 | 0.7129 | 0.3258 | 0.5866 | 0.8393 | 0.8798 | 0.3141 | 0.7580 | 1.0015 | 0.7725 | 0.3077 | 0.6532 | 0.8918 | 0.9880 | 0.0173 | 0.9813 | 0.9947 |
| HMMER | 0.9992 | 0.0039 | 0.9977 | 1.0008 | 0.8487 | 0.2827 | 0.7368 | 0.9605 | 0.9074 | 0.2786 | 0.7972 | 1.0176 | 0.8709 | 0.2745 | 0.7623 | 0.9795 | 0.9945 | 0.0089 | 0.9910 | 0.9980 |
| Hotpep | 0.9988 | 0.0046 | 0.9969 | 1.0006 | 0.8035 | 0.2913 | 0.6883 | 0.9188 | 0.9145 | 0.2681 | 0.8084 | 1.0205 | 0.8439 | 0.2678 | 0.7379 | 0.9498 | 0.9904 | 0.0167 | 0.9838 | 0.9971 |
| Classifier | Mean Eukaryote Specificity | Eukaryote Specificity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Sensitivity | Eukaryote Sensitivity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Precision | Eukaryote Precision Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote F1-score | Eukaryote F1-score Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Accuracy | Eukaryote Accuracy Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9982 | 0.0040 | 0.9955 | 1.0009 | 0.9021 | 0.3001 | 0.7005 | 1.1037 | 0.8561 | 0.3075 | 0.6495 | 1.0627 | 0.8743 | 0.2980 | 0.6741 | 1.0745 | 0.9964 | 0.0050 | 0.9931 | 0.9998 |
| dbCAN | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.9557 | 0.1064 | 0.8842 | 1.0272 | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.9742 | 0.0630 | 0.9319 | 1.0165 | 0.9974 | 0.0064 | 0.9931 | 1.0016 |
| DIAMOND | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.8951 | 0.3004 | 0.6933 | 1.0969 | 0.9091 | 0.3015 | 0.7065 | 1.1116 | 0.9015 | 0.3000 | 0.6999 | 1.1031 | 0.9973 | 0.0065 | 0.9929 | 1.0016 |
| eCAMI | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.8610 | 0.2982 | 0.6607 | 1.0614 | 0.9091 | 0.3015 | 0.7065 | 1.1116 | 0.8825 | 0.2966 | 0.6832 | 1.0817 | 0.9955 | 0.0068 | 0.9909 | 1.0001 |
| HMMER | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.9860 | 0.0464 | 0.9549 | 1.0172 | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.9924 | 0.0251 | 0.9755 | 1.0093 | 0.9982 | 0.0060 | 0.9941 | 1.0022 |
| Hotpep | 0.9979 | 0.0069 | 0.9933 | 1.0025 | 0.8648 | 0.3056 | 0.6595 | 1.0701 | 0.8951 | 0.3004 | 0.6933 | 1.0969 | 0.8769 | 0.2995 | 0.6757 | 1.0781 | 0.9947 | 0.0120 | 0.9866 | 1.0027 |
| Classifier | Mean All Specificity | All Specificity Standard Deviation | All Lower CI | All Upper CI | Mean All Sensitivity | All Sensitivity Standard Deviation | All Lower CI | All Upper CI | Mean All Precision | All Precision Standard Deviation | All Lower CI | All Upper CI | Mean All F1-score | All F1-score Standard Deviation | All Lower CI | All Upper CI | Mean All Accuracy | All Accuracy Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9992 | 0.0028 | 0.9983 | 1.0001 | 0.7751 | 0.3573 | 0.6576 | 0.8925 | 0.8493 | 0.3421 | 0.7369 | 0.9618 | 0.7957 | 0.3402 | 0.6839 | 0.9075 | 0.9919 | 0.0141 | 0.9872 | 0.9965 |
| dbCAN | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.8600 | 0.2674 | 0.7721 | 0.9479 | 0.9474 | 0.2263 | 0.8730 | 1.0217 | 0.8911 | 0.2451 | 0.8105 | 0.9716 | 0.9950 | 0.0083 | 0.9923 | 0.9978 |
| DIAMOND | 0.9995 | 0.0023 | 0.9987 | 1.0002 | 0.8838 | 0.2641 | 0.7970 | 0.9706 | 0.9305 | 0.2375 | 0.8524 | 1.0085 | 0.8948 | 0.2479 | 0.8133 | 0.9763 | 0.9958 | 0.0072 | 0.9935 | 0.9982 |
| eCAMI | 0.9992 | 0.0028 | 0.9983 | 1.0001 | 0.7547 | 0.3215 | 0.6505 | 0.8589 | 0.8880 | 0.3069 | 0.7886 | 0.9875 | 0.8035 | 0.3049 | 0.7047 | 0.9023 | 0.9901 | 0.0154 | 0.9851 | 0.9951 |
| HMMER | 0.9995 | 0.0033 | 0.9984 | 1.0006 | 0.8884 | 0.2465 | 0.8074 | 0.9694 | 0.9342 | 0.2374 | 0.8562 | 1.0122 | 0.9061 | 0.2372 | 0.8281 | 0.9840 | 0.9955 | 0.0083 | 0.9928 | 0.9983 |
| Hotpep | 0.9985 | 0.0053 | 0.9968 | 1.0003 | 0.8213 | 0.2927 | 0.7251 | 0.9175 | 0.9089 | 0.2738 | 0.8189 | 0.9989 | 0.8534 | 0.2736 | 0.7635 | 0.9434 | 0.9917 | 0.0155 | 0.9866 | 0.9967 |
The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.21), sensitivity (6.22), precision (6.23), F1-score (6.24), and accuracy (6.25).
Figure 6.21: One dimensional scatter plot of the specificity per test set for the classification of PL class members, overlaying a box plot
Figure 6.22: One dimensional scatter plot of the sensitivity per test set for the classification of PL class members, overlaying a box plot
Figure 6.23: One dimensional scatter plot of the precision per test set for the classification of PL class members, overlaying a box plot
Figure 6.24: One dimensional scatter plot of the F1-score per test set for the classification of PL class members, overlaying a box plot
Figure 6.25: One dimensional scatter plot of the accuracy per test set for the classification of PL class members, overlaying a box plot
Figure @ref{fig:ceClassTax} plots the difference in performance between bacterial and eukaryota PL class members. Most classifiers showed a strong consistency in performance between the bacterial and eukaryotic test sets (as inferred from the small interquartile ranges), except eCAMI which showed a signficantly greater range in performance when classifying bacterial proteins.
Figure 6.26: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying CE class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.
The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).
| Classifier | Mean Bacteria Specificity | Bacteria Specificity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Sensitivity | Bacteria Sensitivity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Precision | Bacteria Precision Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria F1-score | Bacteria F1-score Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Accuracy | Bacteria Accuracy Standard Deviation | Bacteria Lower CI | Bacteria Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9963 | 0.0108 | 0.9927 | 0.9998 | 0.9565 | 0.0721 | 0.9331 | 0.9798 | 0.9623 | 0.1048 | 0.9283 | 0.9963 | 0.9549 | 0.0734 | 0.9311 | 0.9787 | 0.9932 | 0.0114 | 0.9895 | 0.9969 |
| dbCAN | 0.9929 | 0.0207 | 0.9862 | 0.9996 | 0.9841 | 0.0681 | 0.9620 | 1.0062 | 0.9432 | 0.1501 | 0.8945 | 0.9918 | 0.9536 | 0.1082 | 0.9185 | 0.9887 | 0.9924 | 0.0195 | 0.9860 | 0.9987 |
| DIAMOND | 0.9927 | 0.0215 | 0.9857 | 0.9997 | 0.9279 | 0.1857 | 0.8678 | 0.9881 | 0.9416 | 0.1551 | 0.8913 | 0.9918 | 0.9122 | 0.1679 | 0.8578 | 0.9667 | 0.9894 | 0.0224 | 0.9821 | 0.9966 |
| eCAMI | 0.9902 | 0.0209 | 0.9835 | 0.9970 | 0.9172 | 0.1432 | 0.8708 | 0.9637 | 0.9070 | 0.1585 | 0.8557 | 0.9584 | 0.8980 | 0.1371 | 0.8535 | 0.9424 | 0.9858 | 0.0212 | 0.9789 | 0.9927 |
| HMMER | 0.9965 | 0.0103 | 0.9932 | 0.9998 | 0.9180 | 0.1368 | 0.8736 | 0.9623 | 0.9661 | 0.0922 | 0.9362 | 0.9960 | 0.9314 | 0.0936 | 0.9011 | 0.9618 | 0.9924 | 0.0106 | 0.9890 | 0.9959 |
| Hotpep | 0.9885 | 0.0215 | 0.9815 | 0.9954 | 0.9763 | 0.0717 | 0.9530 | 0.9995 | 0.8965 | 0.1683 | 0.8419 | 0.9511 | 0.9251 | 0.1223 | 0.8855 | 0.9648 | 0.9872 | 0.0218 | 0.9802 | 0.9943 |
| Classifier | Mean Eukaryote Specificity | Eukaryote Specificity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Sensitivity | Eukaryote Sensitivity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Precision | Eukaryote Precision Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote F1-score | Eukaryote F1-score Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Accuracy | Eukaryote Accuracy Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9996 | 0.0019 | 0.9989 | 1.0004 | 0.9055 | 0.2143 | 0.8224 | 0.9886 | 0.9583 | 0.1904 | 0.8845 | 1.0322 | 0.9262 | 0.1966 | 0.8500 | 1.0025 | 0.9965 | 0.0055 | 0.9943 | 0.9986 |
| dbCAN | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.9375 | 0.2111 | 0.8556 | 1.0194 | 0.9643 | 0.1890 | 0.8910 | 1.0376 | 0.9473 | 0.1975 | 0.8707 | 1.0239 | 0.9982 | 0.0055 | 0.9961 | 1.0003 |
| DIAMOND | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.9026 | 0.2673 | 0.7990 | 1.0063 | 0.9286 | 0.2623 | 0.8269 | 1.0303 | 0.9136 | 0.2623 | 0.8119 | 1.0153 | 0.9968 | 0.0086 | 0.9935 | 1.0002 |
| eCAMI | 0.9996 | 0.0020 | 0.9988 | 1.0004 | 0.7314 | 0.3484 | 0.5963 | 0.8666 | 0.8884 | 0.3143 | 0.7665 | 1.0103 | 0.7808 | 0.3228 | 0.6556 | 0.9060 | 0.9922 | 0.0099 | 0.9884 | 0.9961 |
| HMMER | 0.9993 | 0.0027 | 0.9982 | 1.0003 | 0.9929 | 0.0378 | 0.9782 | 1.0075 | 0.9869 | 0.0483 | 0.9682 | 1.0056 | 0.9888 | 0.0330 | 0.9760 | 1.0016 | 0.9989 | 0.0031 | 0.9977 | 1.0001 |
| Hotpep | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.7806 | 0.3181 | 0.6572 | 0.9039 | 0.8929 | 0.3150 | 0.7707 | 1.0150 | 0.8247 | 0.3081 | 0.7052 | 0.9441 | 0.9930 | 0.0084 | 0.9897 | 0.9962 |
| Classifier | Mean All Specificity | All Specificity Standard Deviation | All Lower CI | All Upper CI | Mean All Sensitivity | All Sensitivity Standard Deviation | All Lower CI | All Upper CI | Mean All Precision | All Precision Standard Deviation | All Lower CI | All Upper CI | Mean All F1-score | All F1-score Standard Deviation | All Lower CI | All Upper CI | Mean All Accuracy | All Accuracy Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9977 | 0.0085 | 0.9956 | 0.9997 | 0.9352 | 0.1498 | 0.8986 | 0.9717 | 0.9606 | 0.1455 | 0.9252 | 0.9961 | 0.9429 | 0.1383 | 0.9092 | 0.9766 | 0.9946 | 0.0095 | 0.9923 | 0.9969 |
| dbCAN | 0.9959 | 0.0161 | 0.9920 | 0.9998 | 0.9646 | 0.1464 | 0.9289 | 1.0003 | 0.9520 | 0.1664 | 0.9114 | 0.9926 | 0.9510 | 0.1507 | 0.9142 | 0.9877 | 0.9948 | 0.0155 | 0.9910 | 0.9986 |
| DIAMOND | 0.9958 | 0.0167 | 0.9917 | 0.9998 | 0.9174 | 0.2219 | 0.8632 | 0.9715 | 0.9361 | 0.2050 | 0.8861 | 0.9861 | 0.9128 | 0.2107 | 0.8614 | 0.9642 | 0.9925 | 0.0182 | 0.9880 | 0.9969 |
| eCAMI | 0.9941 | 0.0166 | 0.9901 | 0.9982 | 0.8396 | 0.2646 | 0.7751 | 0.9041 | 0.8992 | 0.2344 | 0.8421 | 0.9564 | 0.8490 | 0.2384 | 0.7909 | 0.9072 | 0.9885 | 0.0176 | 0.9842 | 0.9928 |
| HMMER | 0.9977 | 0.0081 | 0.9957 | 0.9996 | 0.9493 | 0.1129 | 0.9217 | 0.9768 | 0.9748 | 0.0772 | 0.9560 | 0.9936 | 0.9554 | 0.0794 | 0.9360 | 0.9748 | 0.9952 | 0.0089 | 0.9930 | 0.9973 |
| Hotpep | 0.9933 | 0.0173 | 0.9891 | 0.9975 | 0.8945 | 0.2320 | 0.8379 | 0.9511 | 0.8950 | 0.2385 | 0.8368 | 0.9532 | 0.8832 | 0.2235 | 0.8286 | 0.9377 | 0.9896 | 0.0176 | 0.9853 | 0.9939 |
The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.27), sensitivity (6.28), precision (6.29), F1-score (6.30), and accuracy (6.31).
Figure 6.27: One dimensional scatter plot of the specificity per test set for the classification of CE class members, overlaying a box plot
Figure 6.28: One dimensional scatter plot of the sensitivity per test set for the classification of CE class members, overlaying a box plot
Figure 6.29: One dimensional scatter plot of the precision per test set for the classification of CE class members, overlaying a box plot
Figure 6.30: One dimensional scatter plot of the F1-score per test set for the classification of CE class members, overlaying a box plot
Figure 6.31: One dimensional scatter plot of the accuracy per test set for the classification of CE class members, overlaying a box plot
Figure @ref{fig:aaClassTax} plots the difference in performance between bacterial and eukaryota AA class members. As inferred from comparing the interquartile ranges, all classifiers demonstrates a more consistent performance against bacterial than eukaryotic AA class members. However, this most likely due to the AA class predominately containing eukaryotic proteins. Therefore, it is relatively ‘easier’ for a classifier to determine a bacterial protein does not belong to the class because there is low sequence similarity between bacterial proteins and the representative models of the AA class, which over represents eukaryotic proteins. Additionally, with fewer bacterial AA proteins, there are fewer oppurtunities for the classifier to miss classify a AA member as a non-AA member, resulting in a more consistent higher F1-score than eukaryotes, which have many opprutnities for miss classification of AA members as non-AA members.
Figure 6.32: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying AA class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.
The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).
| Classifier | Mean Bacteria Specificity | Bacteria Specificity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Sensitivity | Bacteria Sensitivity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Precision | Bacteria Precision Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria F1-score | Bacteria F1-score Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Accuracy | Bacteria Accuracy Standard Deviation | Bacteria Lower CI | Bacteria Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
| dbCAN | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
| DIAMOND | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
| eCAMI | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
| HMMER | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
| Hotpep | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 |
| Classifier | Mean Eukaryote Specificity | Eukaryote Specificity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Sensitivity | Eukaryote Sensitivity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Precision | Eukaryote Precision Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote F1-score | Eukaryote F1-score Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Accuracy | Eukaryote Accuracy Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9905 | 0.0214 | 0.9820 | 0.9989 | 0.8855 | 0.1292 | 0.8344 | 0.9366 | 0.9154 | 0.1703 | 0.8480 | 0.9828 | 0.8861 | 0.1310 | 0.8343 | 0.9380 | 0.9811 | 0.0274 | 0.9702 | 0.9919 |
| dbCAN | 0.9905 | 0.0226 | 0.9815 | 0.9994 | 0.9139 | 0.1271 | 0.8637 | 0.9642 | 0.9164 | 0.1699 | 0.8492 | 0.9836 | 0.9033 | 0.1368 | 0.8492 | 0.9574 | 0.9844 | 0.0283 | 0.9732 | 0.9956 |
| DIAMOND | 0.9904 | 0.0222 | 0.9816 | 0.9992 | 0.8351 | 0.2779 | 0.7251 | 0.9450 | 0.8826 | 0.2391 | 0.7880 | 0.9771 | 0.8278 | 0.2507 | 0.7286 | 0.9270 | 0.9825 | 0.0248 | 0.9727 | 0.9923 |
| eCAMI | 0.9912 | 0.0193 | 0.9836 | 0.9988 | 0.7837 | 0.1955 | 0.7064 | 0.8611 | 0.9142 | 0.1712 | 0.8465 | 0.9819 | 0.8190 | 0.1560 | 0.7573 | 0.8807 | 0.9751 | 0.0293 | 0.9635 | 0.9867 |
| HMMER | 0.9897 | 0.0213 | 0.9812 | 0.9981 | 0.9550 | 0.0756 | 0.9251 | 0.9849 | 0.9102 | 0.1654 | 0.8448 | 0.9756 | 0.9217 | 0.1183 | 0.8749 | 0.9685 | 0.9850 | 0.0244 | 0.9754 | 0.9947 |
| Hotpep | 0.9901 | 0.0230 | 0.9810 | 0.9992 | 0.8938 | 0.1446 | 0.8365 | 0.9510 | 0.9137 | 0.1749 | 0.8445 | 0.9829 | 0.8890 | 0.1426 | 0.8326 | 0.9454 | 0.9826 | 0.0287 | 0.9712 | 0.9939 |
| Classifier | Mean All Specificity | All Specificity Standard Deviation | All Lower CI | All Upper CI | Mean All Sensitivity | All Sensitivity Standard Deviation | All Lower CI | All Upper CI | Mean All Precision | All Precision Standard Deviation | All Lower CI | All Upper CI | Mean All F1-score | All F1-score Standard Deviation | All Lower CI | All Upper CI | Mean All Accuracy | All Accuracy Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9930 | 0.0187 | 0.9868 | 0.9993 | 0.9165 | 0.1213 | 0.8760 | 0.9569 | 0.9383 | 0.1497 | 0.8884 | 0.9882 | 0.9169 | 0.1226 | 0.8760 | 0.9578 | 0.9862 | 0.0248 | 0.9779 | 0.9945 |
| dbCAN | 0.9930 | 0.0196 | 0.9865 | 0.9996 | 0.9372 | 0.1147 | 0.8989 | 0.9754 | 0.9390 | 0.1492 | 0.8892 | 0.9887 | 0.9294 | 0.1241 | 0.8881 | 0.9708 | 0.9886 | 0.0251 | 0.9803 | 0.9970 |
| DIAMOND | 0.9930 | 0.0194 | 0.9866 | 0.9995 | 0.8796 | 0.2475 | 0.7971 | 0.9622 | 0.9143 | 0.2099 | 0.8443 | 0.9843 | 0.8743 | 0.2267 | 0.7987 | 0.9499 | 0.9872 | 0.0225 | 0.9797 | 0.9947 |
| eCAMI | 0.9936 | 0.0169 | 0.9880 | 0.9992 | 0.8422 | 0.1926 | 0.7780 | 0.9064 | 0.9374 | 0.1505 | 0.8872 | 0.9876 | 0.8679 | 0.1556 | 0.8160 | 0.9198 | 0.9818 | 0.0273 | 0.9727 | 0.9909 |
| HMMER | 0.9925 | 0.0187 | 0.9862 | 0.9987 | 0.9671 | 0.0673 | 0.9447 | 0.9896 | 0.9345 | 0.1462 | 0.8857 | 0.9832 | 0.9429 | 0.1066 | 0.9073 | 0.9784 | 0.9891 | 0.0218 | 0.9818 | 0.9963 |
| Hotpep | 0.9928 | 0.0201 | 0.9861 | 0.9995 | 0.9225 | 0.1319 | 0.8785 | 0.9664 | 0.9370 | 0.1536 | 0.8858 | 0.9883 | 0.9190 | 0.1311 | 0.8753 | 0.9627 | 0.9873 | 0.0256 | 0.9788 | 0.9958 |
The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.33), sensitivity (6.34), precision (6.35), F1-score (6.36), and accuracy (6.37).
Figure 6.33: One dimensional scatter plot of the specificity per test set for the classification of AA class members, overlaying a box plot
Figure 6.34: One dimensional scatter plot of the sensitivity per test set for the classification of AA class members, overlaying a box plot
Figure 6.35: One dimensional scatter plot of the precision per test set for the classification of AA class members, overlaying a box plot
Figure 6.36: One dimensional scatter plot of the F1-score per test set for the classification of AA class members, overlaying a box plot
Figure 6.37: One dimensional scatter plot of the accuracy per test set for the classification of AA class members, overlaying a box plot
Figure @ref{fig:cbmClassTax} plots the difference in performance between bacterial and eukaryota CBM class members. Most classifiers demonstrated a greater variation in performance against eukaryotic than bacterial proteins, which may be the result of greater sequence diversity within the eukaryotic CBMs than bacterial CBMs.
Figure 6.38: One dimensional scatter plot overlaying a box and whisker plot of the F1-score of classifying CBM class members for CAZyme classifiers, when parsing data from bacterial, eukaryote or both (identified as ‘all’) kingdoms. One point on the scatter plot represents the F1-score for one test set.
The following tables summarise the performance for each classifier across all test sets for each taxonomic group (bacteria (table ??) and eukaryota (table ??)), and when all test sets are pooled (which is assinged the taxonomic group ‘All’) (table ??).
| Classifier | Mean Bacteria Specificity | Bacteria Specificity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Sensitivity | Bacteria Sensitivity Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Precision | Bacteria Precision Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria F1-score | Bacteria F1-score Standard Deviation | Bacteria Lower CI | Bacteria Upper CI | Mean Bacteria Accuracy | Bacteria Accuracy Standard Deviation | Bacteria Lower CI | Bacteria Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.8684 | 0.1136 | 0.8316 | 0.9052 |
| dbCAN | 0.9937 | 0.0112 | 0.9901 | 0.9974 | 0.8371 | 0.1957 | 0.7736 | 0.9005 | 0.9276 | 0.1309 | 0.8851 | 0.9700 | 0.8643 | 0.1577 | 0.8132 | 0.9155 | 0.9754 | 0.0288 | 0.9661 | 0.9847 |
| DIAMOND | 0.9949 | 0.0117 | 0.9912 | 0.9987 | 0.8664 | 0.2074 | 0.7991 | 0.9336 | 0.9511 | 0.1653 | 0.8975 | 1.0047 | 0.8986 | 0.1810 | 0.8399 | 0.9573 | 0.9811 | 0.0256 | 0.9728 | 0.9894 |
| eCAMI | 0.9325 | 0.0594 | 0.9132 | 0.9518 | 0.8460 | 0.2300 | 0.7715 | 0.9206 | 0.6447 | 0.1787 | 0.5867 | 0.7026 | 0.7118 | 0.1856 | 0.6516 | 0.7719 | 0.9253 | 0.0579 | 0.9066 | 0.9441 |
| HMMER | 0.9948 | 0.0097 | 0.9917 | 0.9980 | 0.5664 | 0.2516 | 0.4849 | 0.6480 | 0.9243 | 0.1432 | 0.8779 | 0.9707 | 0.6602 | 0.1984 | 0.5958 | 0.7245 | 0.9468 | 0.0327 | 0.9362 | 0.9574 |
| Hotpep | 0.8869 | 0.0638 | 0.8662 | 0.9076 | 0.8210 | 0.2358 | 0.7445 | 0.8974 | 0.4834 | 0.1837 | 0.4239 | 0.5430 | 0.5902 | 0.1900 | 0.5286 | 0.6518 | 0.8812 | 0.0629 | 0.8608 | 0.9016 |
| Classifier | Mean Eukaryote Specificity | Eukaryote Specificity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Sensitivity | Eukaryote Sensitivity Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Precision | Eukaryote Precision Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote F1-score | Eukaryote F1-score Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI | Mean Eukaryote Accuracy | Eukaryote Accuracy Standard Deviation | Eukaryote Lower CI | Eukaryote Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.9063 | 0.0369 | 0.8928 | 0.9199 |
| dbCAN | 0.9937 | 0.0093 | 0.9903 | 0.9971 | 0.7549 | 0.1932 | 0.6840 | 0.8258 | 0.9227 | 0.1175 | 0.8796 | 0.9658 | 0.8169 | 0.1492 | 0.7621 | 0.8716 | 0.9698 | 0.0253 | 0.9605 | 0.9790 |
| DIAMOND | 0.9944 | 0.0079 | 0.9915 | 0.9973 | 0.8653 | 0.2010 | 0.7916 | 0.9390 | 0.9327 | 0.0997 | 0.8961 | 0.9692 | 0.8847 | 0.1498 | 0.8297 | 0.9396 | 0.9831 | 0.0208 | 0.9755 | 0.9908 |
| eCAMI | 0.9679 | 0.0292 | 0.9572 | 0.9786 | 0.7682 | 0.2024 | 0.6939 | 0.8424 | 0.7330 | 0.1893 | 0.6636 | 0.8025 | 0.7344 | 0.1657 | 0.6736 | 0.7952 | 0.9481 | 0.0385 | 0.9339 | 0.9622 |
| HMMER | 0.9975 | 0.0055 | 0.9955 | 0.9996 | 0.3410 | 0.1684 | 0.2792 | 0.4027 | 0.8849 | 0.2732 | 0.7847 | 0.9852 | 0.4773 | 0.2055 | 0.4019 | 0.5527 | 0.9360 | 0.0334 | 0.9237 | 0.9482 |
| Hotpep | 0.9194 | 0.0398 | 0.9048 | 0.9340 | 0.7399 | 0.1994 | 0.6668 | 0.8131 | 0.4898 | 0.1365 | 0.4397 | 0.5399 | 0.5723 | 0.1282 | 0.5253 | 0.6193 | 0.9008 | 0.0446 | 0.8844 | 0.9171 |
| Classifier | Mean All Specificity | All Specificity Standard Deviation | All Lower CI | All Upper CI | Mean All Sensitivity | All Sensitivity Standard Deviation | All Lower CI | All Upper CI | Mean All Precision | All Precision Standard Deviation | All Lower CI | All Upper CI | Mean All F1-score | All F1-score Standard Deviation | All Lower CI | All Upper CI | Mean All Accuracy | All Accuracy Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 1.0000 | 0.0000 | 1.0000 | 1.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.8852 | 0.0898 | 0.8638 | 0.9066 |
| dbCAN | 0.9937 | 0.0103 | 0.9912 | 0.9962 | 0.8007 | 0.1975 | 0.7536 | 0.8478 | 0.9254 | 0.1243 | 0.8958 | 0.9551 | 0.8433 | 0.1547 | 0.8064 | 0.8802 | 0.9729 | 0.0272 | 0.9664 | 0.9794 |
| DIAMOND | 0.9947 | 0.0101 | 0.9923 | 0.9971 | 0.8659 | 0.2031 | 0.8175 | 0.9143 | 0.9429 | 0.1395 | 0.9097 | 0.9762 | 0.8924 | 0.1669 | 0.8526 | 0.9322 | 0.9820 | 0.0235 | 0.9764 | 0.9876 |
| eCAMI | 0.9482 | 0.0513 | 0.9359 | 0.9604 | 0.8116 | 0.2202 | 0.7591 | 0.8641 | 0.6838 | 0.1874 | 0.6391 | 0.7285 | 0.7218 | 0.1762 | 0.6798 | 0.7638 | 0.9354 | 0.0512 | 0.9232 | 0.9476 |
| HMMER | 0.9960 | 0.0082 | 0.9941 | 0.9980 | 0.4666 | 0.2448 | 0.4082 | 0.5250 | 0.9069 | 0.2101 | 0.8568 | 0.9570 | 0.5792 | 0.2200 | 0.5267 | 0.6316 | 0.9420 | 0.0332 | 0.9341 | 0.9499 |
| Hotpep | 0.9013 | 0.0565 | 0.8878 | 0.9148 | 0.7851 | 0.2226 | 0.7320 | 0.8381 | 0.4862 | 0.1634 | 0.4473 | 0.5252 | 0.5823 | 0.1646 | 0.5430 | 0.6215 | 0.8898 | 0.0560 | 0.8765 | 0.9032 |
The following plots present the performance for each classifier for each test set for the following performance statistics: specificity (figure 6.39), sensitivity (6.40), precision (6.41), F1-score (6.42), and accuracy (6.43).
Figure 6.39: One dimensional scatter plot of the specificity per test set for the classification of CBM class members, overlaying a box plot
Figure 6.40: One dimensional scatter plot of the sensitivity per test set for the classification of CBM class members, overlaying a box plot
Figure 6.41: One dimensional scatter plot of the precision per test set for the classification of CBM class members, overlaying a box plot
Figure 6.42: One dimensional scatter plot of the F1-score per test set for the classification of CBM class members, overlaying a box plot
Figure 6.43: One dimensional scatter plot of the accuracy per test set for the classification of CBM class members, overlaying a box plot
To represent the overall CAZy class classification performance, and take into consideration of CAZy class multi-label classification, the Rand Index was calculated for each taxonomy group per CAZy classifier.
| Prediction_tool | Bact Mean | Bact Standard Deviation | Bact Lower CI | Bact Upper CI | Euk Mean | Euk Standard Deviation | Euk Lower CI | Euk Upper CI | All Mean | All Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9615 | 0.1074 | 0.9591 | 0.9639 | 0.9637 | 0.1044 | 0.9611 | 0.9663 | 0.9625 | 0.1061 | 0.9611 | 0.9663 |
| dbCAN | 0.9802 | 0.0794 | 0.9785 | 0.9820 | 0.9781 | 0.0832 | 0.9760 | 0.9801 | 0.9793 | 0.0811 | 0.9760 | 0.9801 |
| DIAMOND | 0.9845 | 0.0711 | 0.9830 | 0.9861 | 0.9844 | 0.0710 | 0.9826 | 0.9862 | 0.9845 | 0.0711 | 0.9826 | 0.9862 |
| eCAMI | 0.9674 | 0.1008 | 0.9652 | 0.9697 | 0.9630 | 0.1064 | 0.9604 | 0.9657 | 0.9655 | 0.1034 | 0.9604 | 0.9657 |
| HMMER | 0.9725 | 0.0926 | 0.9704 | 0.9745 | 0.9750 | 0.0884 | 0.9728 | 0.9772 | 0.9736 | 0.0908 | 0.9728 | 0.9772 |
| Hotpep | 0.9495 | 0.1217 | 0.9468 | 0.9522 | 0.9533 | 0.1170 | 0.9504 | 0.9562 | 0.9512 | 0.1197 | 0.9504 | 0.9562 |
The Adjusted Rand Index was also calculated in order to take into consideration chance.
| Prediction_tool | Bact Mean | Bact Standard Deviation | Bact Lower CI | Bact Upper CI | Euk Mean | Euk Standard Deviation | Euk Lower CI | Euk Upper CI | All Mean | All Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9004 | 0.2829 | 0.8941 | 0.9066 | 0.9011 | 0.2880 | 0.8939 | 0.9083 | 0.9007 | 0.2852 | 0.8939 | 0.9083 |
| dbCAN | 0.9427 | 0.2304 | 0.9376 | 0.9478 | 0.9361 | 0.2426 | 0.9301 | 0.9422 | 0.9398 | 0.2359 | 0.9301 | 0.9422 |
| DIAMOND | 0.9546 | 0.2078 | 0.9500 | 0.9592 | 0.9543 | 0.2080 | 0.9491 | 0.9595 | 0.9545 | 0.2079 | 0.9491 | 0.9595 |
| eCAMI | 0.9140 | 0.2691 | 0.9081 | 0.9200 | 0.8958 | 0.3006 | 0.8884 | 0.9033 | 0.9060 | 0.2836 | 0.8884 | 0.9033 |
| HMMER | 0.9225 | 0.2622 | 0.9167 | 0.9284 | 0.9322 | 0.2425 | 0.9262 | 0.9383 | 0.9268 | 0.2537 | 0.9262 | 0.9383 |
| Hotpep | 0.8681 | 0.3222 | 0.8609 | 0.8752 | 0.8739 | 0.3201 | 0.8659 | 0.8818 | 0.8706 | 0.3212 | 0.8659 | 0.8818 |
| Prediction_tool | Bact Mean | Bact Standard Deviation | Bact Lower CI | Bact Upper CI | Euk Mean | Euk Standard Deviation | Euk Lower CI | Euk Upper CI | All Mean | All Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9994 | 0.0016 | 0.9994 | 0.9995 | 0.9995 | 0.0015 | 0.9994 | 0.9995 | 0.9995 | 0.0015 | 0.9994 | 0.9995 |
| dbCAN | 0.9997 | 0.0011 | 0.9997 | 0.9997 | 0.9997 | 0.0012 | 0.9997 | 0.9997 | 0.9997 | 0.0011 | 0.9997 | 0.9997 |
| DIAMOND | 0.9998 | 0.0010 | 0.9997 | 0.9998 | 0.9998 | 0.0010 | 0.9997 | 0.9998 | 0.9998 | 0.0010 | 0.9998 | 0.9998 |
| eCAMI | 0.9994 | 0.0018 | 0.9994 | 0.9994 | 0.9995 | 0.0015 | 0.9994 | 0.9995 | 0.9994 | 0.0017 | 0.9994 | 0.9995 |
| HMMER | 0.9996 | 0.0014 | 0.9995 | 0.9996 | 0.9996 | 0.0014 | 0.9996 | 0.9996 | 0.9996 | 0.0014 | 0.9996 | 0.9996 |
| Hotpep | 0.9990 | 0.0025 | 0.9989 | 0.9990 | 0.9993 | 0.0019 | 0.9992 | 0.9993 | 0.9991 | 0.0023 | 0.9991 | 0.9991 |
| Prediction_tool | Bact Mean | Bact Standard Deviation | Bact Lower CI | Bact Upper CI | Eukaryote Mean | Eukaryote Standard Deviation | Euk Lower CI | Euk Upper CI | All Mean | All Standard Deviation | All Lower CI | All Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9118 | 0.2654 | 0.9059 | 0.9177 | 0.9073 | 0.2782 | 0.9003 | 0.9142 | 0.9098 | 0.2712 | 0.9053 | 0.9143 |
| dbCAN | 0.9420 | 0.2307 | 0.9369 | 0.9472 | 0.9354 | 0.2422 | 0.9294 | 0.9414 | 0.9391 | 0.2359 | 0.9352 | 0.9430 |
| DIAMOND | 0.9529 | 0.2104 | 0.9482 | 0.9576 | 0.9531 | 0.2105 | 0.9478 | 0.9583 | 0.9530 | 0.2105 | 0.9495 | 0.9565 |
| eCAMI | 0.9148 | 0.2621 | 0.9090 | 0.9207 | 0.8988 | 0.2961 | 0.8914 | 0.9062 | 0.9077 | 0.2778 | 0.9031 | 0.9123 |
| HMMER | 0.9201 | 0.2647 | 0.9142 | 0.9260 | 0.9311 | 0.2430 | 0.9251 | 0.9372 | 0.9250 | 0.2554 | 0.9208 | 0.9292 |
| Hotpep | 0.8715 | 0.3087 | 0.8647 | 0.8784 | 0.8812 | 0.3077 | 0.8735 | 0.8889 | 0.8758 | 0.3083 | 0.8707 | 0.8809 |
Often, classifiers are not used in isolation. Frequently, classifiers are combined to produce an overall more accurate classifier. An example of this is dbCAN. dbCAN contains the classifiers HMMER, Hotpep and DIAMOND, the consensus classifications of these classifiers are interpreted as the output for dbCAN.
Defining new combinations of classifiers may reveal a combination that is more accurate than existing combinations and/or using the tools in isolation.
The following combinations of tools were evaluted: - HMMER, DIAMOND and CUPP - HMMER, DIAMOND and eCAMI
Table @ref{sumstatsRecombined} contains the summary statistics for the binary classification of proteins, for the inividual and combined classifiers.
| Classifier | Spec Mean | Spec Standard Deviation | Spec Lower CI | Spec Upper CI | Sens Mean | Sens Standard Deviation | Sens Lower CI | Sens Upper CI | Prec Mean | Prec Standard Deviation | Prec Lower CI | Prec Upper CI | F1-score Mean | F1-score Standard Deviation | F1-score Lower CI | F1-score Upper CI | Acc Mean | Acc Standard Deviation | Acc Lower CI | Acc Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9917 | 0.0155 | 0.9891 | 0.9943 | 0.8570 | 0.0822 | 0.8433 | 0.8707 | 0.9908 | 0.0172 | 0.9879 | 0.9936 | 0.9167 | 0.0529 | 0.9078 | 0.9255 | 0.9244 | 0.0416 | 0.9174 | 0.9313 |
| dbCAN | 0.9869 | 0.0244 | 0.9828 | 0.9909 | 0.9087 | 0.1119 | 0.8900 | 0.9274 | 0.9866 | 0.0240 | 0.9826 | 0.9906 | 0.9412 | 0.0793 | 0.9280 | 0.9545 | 0.9478 | 0.0562 | 0.9384 | 0.9572 |
| DIAMOND | 0.9844 | 0.0262 | 0.9800 | 0.9888 | 0.9261 | 0.1293 | 0.9045 | 0.9478 | 0.9847 | 0.0251 | 0.9805 | 0.9889 | 0.9481 | 0.0904 | 0.9329 | 0.9632 | 0.9553 | 0.0639 | 0.9446 | 0.9660 |
| eCAMI | 0.9836 | 0.0256 | 0.9793 | 0.9879 | 0.8610 | 0.1323 | 0.8389 | 0.8831 | 0.9826 | 0.0253 | 0.9784 | 0.9868 | 0.9112 | 0.0865 | 0.8967 | 0.9256 | 0.9223 | 0.0644 | 0.9115 | 0.9331 |
| HMMER | 0.9901 | 0.0162 | 0.9874 | 0.9929 | 0.8831 | 0.0832 | 0.8692 | 0.8970 | 0.9893 | 0.0174 | 0.9864 | 0.9922 | 0.9305 | 0.0611 | 0.9203 | 0.9407 | 0.9366 | 0.0421 | 0.9296 | 0.9437 |
| NA | 0.9837 | 0.0285 | 0.9790 | 0.9885 | 0.9137 | 0.0285 | 0.9090 | 0.9185 | 0.9825 | 0.0306 | 0.9774 | 0.9876 | 0.9469 | 0.0295 | 0.9419 | 0.9518 | 0.9487 | 0.0285 | 0.9440 | 0.9535 |
| NA | 0.9806 | 0.0323 | 0.9752 | 0.9860 | 0.9406 | 0.0323 | 0.9352 | 0.9460 | 0.9798 | 0.0336 | 0.9741 | 0.9854 | 0.9598 | 0.0329 | 0.9543 | 0.9653 | 0.9606 | 0.0323 | 0.9552 | 0.9660 |
| Hotpep | 0.9840 | 0.0256 | 0.9797 | 0.9883 | 0.8189 | 0.1322 | 0.7968 | 0.8410 | 0.9815 | 0.0286 | 0.9767 | 0.9863 | 0.8862 | 0.0914 | 0.8709 | 0.9015 | 0.9014 | 0.0664 | 0.8903 | 0.9125 |
Figure 7.1: Summary statistics of CAZyme classifiers performances of binary CAZyme/non-CAZyme prediction. The mean plus and minus the 95% confidence interval.
Figure @ref{RTstatsRecombined} presents the distribution of statistical parameters per CAZyme classifer (including recombined classifiers) for each statistical parameter for evaluating differentiation of CAZymes and non-CAZymes.
Figure 7.2: Proportional area plot of the disitrubution of statistical parameters across all test sets.
Specificity is the proportion of known negatives (known non-CAZymes) which are correctly classified as negatives (non-CAZymes).
Figure 3.2 is a graphical representation of the results calculated in table 3.1.
Figure 7.3: One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.
Sensitivity (also known as recall) is the proportion of known positives (CAZymes) that are correctly identified as positives (CAZymes).
Figure 3.3 graphically represents of the results calculated in table 3.1.
Figure 7.4: One-dimensional scatter plot of recall (sensitivity) scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.
Precision is the proportion of positive predictions by the classifiers that are correct.
In this case, precision represents the fraction of CAZyme predictions by the classifiers that are correct, specifically the proportion of predicted CAZymes that are known CAZymes.
Figure 3.4 is a visual representation of the results calculated in table 3.1.
Figure 7.5: One-dimensional scatter plot of precision scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.
The F1-score is a harmonic (or weighted) average of recall and precision and provides an idea of the overall performance of the tool, 0 being the lowest and 1 being the best performance. Figure 3.5 shows the F1-score from each test set, for each classifier.
Figure 7.6: Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.
Accuarcy (calculated using (TP + TN) / (TP + TN + FP + FN) ) provides an idea of the overall performance of the classifiers as a measure of the degree to which their CAZyme/non-CAZyme predictions conforms to the correct result. Figure 3.6 is a plot of respective data from table 3.1.
Figure 7.7: Bar chart of specificity of CAZyme classifiers differentiation between CAZymes and non-CAZymes.
Below is a combination (3x2) plot of the above plots for evaluating the binary CAZyme/non-CAZyme classification performance between dbCAN and the user defined combination of tools. In this case: - dbCAN - HMMER, DIAMOND, CUPP - HMMER, DIAMOND, eCAMI
recombined_tools_class_df_pred
| Classifier | Spec Mean | Spec Standard Deviation | Spec Lower CI | Spec Upper CI | Sens Mean | Sens Standard Deviation | Sens Lower CI | Sens Upper CI | Prec Mean | Prec Standard Deviation | Prec Lower CI | Prec Upper CI | F1-score Mean | F1-score Standard Deviation | F1-score Lower CI | F1-score Upper CI | Acc Mean | Acc Standard Deviation | Acc Lower CI | Acc Upper CI | Prediction_tool |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 0.9975 | 0.0097 | 0.9964 | 0.9985 | 0.7118 | 0.3888 | 0.6711 | 0.7526 | 0.7695 | 0.4098 | 0.7265 | 0.8124 | 0.7343 | 0.3937 | 0.6930 | 0.7756 | 0.9554 | 0.0635 | 0.9487 | 0.9620 | CUPP |
| dbCAN | 0.9960 | 0.0126 | 0.9947 | 0.9973 | 0.9016 | 0.1705 | 0.8837 | 0.9194 | 0.9624 | 0.1294 | 0.9488 | 0.9760 | 0.9218 | 0.1454 | 0.9065 | 0.9370 | 0.9779 | 0.0417 | 0.9735 | 0.9823 | dbCAN |
| DIAMOND | 0.9956 | 0.0130 | 0.9942 | 0.9969 | 0.9078 | 0.1960 | 0.8872 | 0.9283 | 0.9578 | 0.1526 | 0.9418 | 0.9738 | 0.9213 | 0.1725 | 0.9032 | 0.9394 | 0.9816 | 0.0426 | 0.9771 | 0.9861 | DIAMOND |
| eCAMI | 0.9852 | 0.0324 | 0.9818 | 0.9886 | 0.8362 | 0.2157 | 0.8137 | 0.8588 | 0.8966 | 0.2066 | 0.8749 | 0.9182 | 0.8487 | 0.1950 | 0.8282 | 0.8691 | 0.9590 | 0.0536 | 0.9534 | 0.9646 | eCAMI |
| HMMER | 0.9966 | 0.0103 | 0.9955 | 0.9977 | 0.8270 | 0.2407 | 0.8017 | 0.8522 | 0.9612 | 0.1388 | 0.9466 | 0.9757 | 0.8675 | 0.2013 | 0.8464 | 0.8886 | 0.9686 | 0.0392 | 0.9645 | 0.9727 | HMMER |
| HMMER_DIAMOND_CUPP | 0.9979 | 0.0091 | 0.9969 | 0.9988 | 0.8234 | 0.2637 | 0.7958 | 0.8511 | 0.9711 | 0.1416 | 0.9563 | 0.9860 | 0.8648 | 0.2235 | 0.8414 | 0.8883 | 0.9723 | 0.0404 | 0.9681 | 0.9766 | HMMER_DIAMOND_CUPP |
| HMMER_DIAMOND_eCAMI | 0.9963 | 0.0121 | 0.9950 | 0.9975 | 0.9020 | 0.1851 | 0.8826 | 0.9214 | 0.9621 | 0.1345 | 0.9480 | 0.9762 | 0.9208 | 0.1580 | 0.9043 | 0.9374 | 0.9799 | 0.0416 | 0.9755 | 0.9842 | HMMER_DIAMOND_eCAMI |
| Hotpep | 0.9749 | 0.0471 | 0.9700 | 0.9799 | 0.8317 | 0.2120 | 0.8095 | 0.8540 | 0.8576 | 0.2495 | 0.8314 | 0.8837 | 0.8207 | 0.2116 | 0.7985 | 0.8429 | 0.9421 | 0.0673 | 0.9350 | 0.9491 | Hotpep |
Below a proportional area plot representing the F-beta score for each CAZyme classifier for each test set is generated. each square is sized proportional to the relative sample size. Every class was not included in every sample, resulting in different sample sizes between CAZy classes, the same between classifiers.
A dataframe of the number of test sets containing each CAZy class is generated.
## Prediction_tool GH GT PL CE AA CBM
## 1 dbCAN 70 70 38 67 37 70
## 2 HMMER 70 70 38 67 37 70
## 3 DIAMOND 70 70 38 67 37 70
## 4 Hotpep 70 70 38 67 37 70
## 5 CUPP 70 70 38 67 37 70
## 6 eCAMI 70 70 39 67 37 70
## 7 H_D_C 70 70 38 67 37 70
## 8 H_D_E 70 70 38 67 37 70
Figure 7.8: 95% confidence interval around the mean CAZy class classification per CAZy class
The sensitivity of each CAZyme classifier can be plotted against the specificity for each CAZy class, however plotting all CAZy classes in a single plot produces an overally cramped plot, unless very few test sets were used.
Below the prediction sensitivity is plotted against the specificity for each classifier, and a separate plot is generated for each CAZy class.
The scatter plots of sensitivity against specificity overlay a coloured contour to highlight the distribution of the points. When too many points have the same value a contour cannot be generated. In order to plot a contour noise is added to the data. The original data is used to plot the scatter plot and the data with added noise is used to plot the contour.
The percentage of the data points which need noise to be added to them in order to generate a contour varies from data set to data set. To change the percentage of the data points with noise added to them, change the third value of call to the function plot.class.sens.vs.spec(), which is used to generate the plots. The third value is the percentage of data points to add noise to, written in decimal form.
## png
## 2
## png
## 2
## png
## 2
## png
## 2
## png
## 2
## png
## 2
A single CAZyme can be included in multiple CAZy classes leading to the multilabel classification of CAZymes. To address this and evaluate the multilabel classification of CAZy classes the Rand Index (RI) and Adjusted Rand Index (ARI) were calculated.
The RI is the measure of accuracy across all potential classifications of a protein. The RI ranges from 0 (no correct annotations) to 1 (all annotations correct). The ARI is the RI adjusted for chance, where 0 is the equivalent to assigning the CAZy class annotations randomly, -1 where the annotations are systematically handed out incorrectly and 1 where the annotations are all correct.
Figure 7.9: 95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
| Prediction_tool | Mean | Standard Deviation | Lower CI | Upper CI |
|---|---|---|---|---|
| dbCAN | 0.9455 | 0.2254 | 0.9418 | 0.9492 |
| HMMER | 0.9268 | 0.2537 | 0.9226 | 0.9310 |
| DIAMOND | 0.9545 | 0.2079 | 0.9510 | 0.9579 |
| Hotpep | 0.8706 | 0.3212 | 0.8653 | 0.8759 |
| CUPP | 0.9007 | 0.2852 | 0.8960 | 0.9054 |
| eCAMI | 0.9060 | 0.2836 | 0.9013 | 0.9107 |
| HMMER_DIAMOND_CUPP | 0.9355 | 0.2392 | 0.9316 | 0.9395 |
| HMMER_DIAMOND_eCAMI | 0.9505 | 0.2155 | 0.9470 | 0.9541 |
Figure 7.10: 95% confidence interval around the mean of Adjusted Rand Index (ARI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
Plot are violin plots underlying scatter plots, presenting the RI and ARI for every protein across all test sets.
Figure 7.11: Violin plot of Rand Index (RI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
Figure 7.12: Violin plot of Adjusted Rand Index (ARI) of performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
The following section evaluates the performance of combining CAZyme classifiers on predict CAZy family classifications, comparing the performance between the user-defined combination of classifiers and the individual classifiers.
Table 7.4 summarising the overall CAZy family classifications for each test set across all CAZy families. recombined_tools_fam_df
| Classifier | Spec Mean | Spec Standard Deviation | Spec Lower CI | Spec Upper CI | Sens Mean | Sens Standard Deviation | Sens Lower CI | Sens Upper CI | Prec Mean | Prec Standard Deviation | Prec Lower CI | Prec Upper CI | F1-score Mean | F1-score Standard Deviation | F1-score Lower CI | F1-score Upper CI | Acc Mean | Acc Standard Deviation | Acc Lower CI | Acc Upper CI |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CUPP | 1.0000 | 2e-04 | 0.9999 | 1.0000 | 0.6582 | 0.4360 | 0.6084 | 0.7081 | 0.7048 | 0.4458 | 0.6538 | 0.7558 | 0.6723 | 0.4354 | 0.6225 | 0.7221 | 0.9992 | 0.0023 | 0.9989 | 0.9994 |
| dbCAN | 0.9999 | 3e-04 | 0.9999 | 1.0000 | 0.8874 | 0.2417 | 0.8597 | 0.9150 | 0.9309 | 0.2275 | 0.9048 | 0.9569 | 0.8997 | 0.2349 | 0.8728 | 0.9265 | 0.9995 | 0.0014 | 0.9994 | 0.9997 |
| DIAMOND | 0.9999 | 3e-04 | 0.9999 | 1.0000 | 0.8927 | 0.2386 | 0.8654 | 0.9200 | 0.9268 | 0.2257 | 0.9010 | 0.9527 | 0.9025 | 0.2323 | 0.8760 | 0.9291 | 0.9997 | 0.0008 | 0.9996 | 0.9997 |
| eCAMI | 0.9997 | 9e-04 | 0.9996 | 0.9998 | 0.7356 | 0.3412 | 0.6972 | 0.7739 | 0.7791 | 0.3671 | 0.7378 | 0.8203 | 0.7372 | 0.3437 | 0.6986 | 0.7758 | 0.9992 | 0.0016 | 0.9990 | 0.9994 |
| HMMER | 0.9999 | 3e-04 | 0.9999 | 0.9999 | 0.8703 | 0.2814 | 0.8383 | 0.9022 | 0.8861 | 0.2791 | 0.8545 | 0.9178 | 0.8640 | 0.2781 | 0.8325 | 0.8956 | 0.9994 | 0.0022 | 0.9991 | 0.9996 |
| HMMER_DIAMOND_CUPP | 1.0000 | 2e-04 | 0.9999 | 1.0000 | 0.8623 | 0.2864 | 0.8296 | 0.8951 | 0.9179 | 0.2602 | 0.8881 | 0.9477 | 0.8780 | 0.2761 | 0.8464 | 0.9096 | 0.9995 | 0.0019 | 0.9992 | 0.9997 |
| HMMER_DIAMOND_eCAMI | 0.9999 | 3e-04 | 0.9999 | 1.0000 | 0.8820 | 0.2480 | 0.8536 | 0.9103 | 0.9270 | 0.2372 | 0.8998 | 0.9541 | 0.8966 | 0.2416 | 0.8690 | 0.9243 | 0.9996 | 0.0010 | 0.9995 | 0.9997 |
| Hotpep | 0.9994 | 2e-03 | 0.9991 | 0.9996 | 0.7621 | 0.3347 | 0.7248 | 0.7993 | 0.7661 | 0.3771 | 0.7241 | 0.8081 | 0.7305 | 0.3504 | 0.6915 | 0.7695 | 0.9987 | 0.0034 | 0.9983 | 0.9991 |
The evaluate the overall performance of each classifier, for each CAZy family, the F1-score was calculated for every family. Families were grouped by their parent CAZy class and the distribution of the F1-scores is shown in figure 5.1.
Figure 7.13: Proportaional area plot of F1-score per CAZy distribution per CAZy famiy
5.1 Below is a table displaying the number of test sets in which each CAZy class was present, and were used to draw the proportional areas for each class in figure5.1.
## Prediction_tool GH GT PL CE AA CBM
## 1 dbCAN 124 70 22 16 14 50
## 2 HMMER 126 72 22 16 14 51
## 3 DIAMOND 124 70 22 16 14 50
## 4 Hotpep 125 70 22 16 14 65
## 5 CUPP 124 70 22 16 14 50
## 6 eCAMI 124 70 22 16 14 61
## 7 H_D_C 0 0 0 0 0 0
## 8 H_D_E 0 0 0 0 0 0
To evaluate the performance of predicting each CAZy family independent of all other CAZy families, the sensitivity and precision for each CAZy family, for each CAZyme classifier was calculated and plotted against each other (Fig.??). Whereas sensitivity was plotted against sensitivity for CAZy classes, owing to the extremely small variation in specificity scores, sensitivity was plotted as a percentage against log10 of the specificity percentage.
Later on in this report the sensitivity for each CAZy family is plotted against specificity, as was done with CAZy class. However, owing to extremely small different in specificity, with no tool producing a specificity less than 0.995 it is extremely difficult to separate performance by specificity, so a boxplot and scatter plot for each is plotted. Each point represents one test set, and test sets are grouped by CAZyme classifier and facet wrapped by the parent CAZy class.
Figure 7.14: 95% confidence interval around the mean of CAZy family classification.
Figure 7.15: 95% confidence interval around the mean CAZy family classifier per CAZy class
For better resolution we can group the CAZy families by their parent CAzy classes, and compare the performances of the tools CAZy class, by CAZy class. Owing to the minimal variation in specificity scores, specificity was plotted as the percentage specificity log10.
Figure 7.16 shows the plotting of sensitivity against specificity for each Glycoside Hydrolase CAZy family.
Figure 7.16: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycoside Hydrolases. Each GH CAZy family is represented as a single point on the plot.
Figure 7.17 shows the plotting of sensitivity against specificity for each Glycosyltransferases CAZy family.
Figure 7.17: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Glycosyltransferases. Each GT CAZy family is represented as a single point on the plot.
Figure 7.18 shows the plotting of sensitivity against specificity for each Polysaccharide Lyases CAZy family.
Figure 7.18: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Polysaccharide Lyases. Each PL CAZy family is represented as a single point on the plot.
Figure ?? shows the plotting of sensitivity against specificity for each Carbohydrate Esterases CAZy family.
Figure 7.19: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Esterases. Each CE CAZy family is represented as a single point on the plot.
Figure ?? shows the plotting of sensitivity against specificity for each Auxillary Activities CAZy family.
Figure 7.20: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Auxillary Activities. Each AA CAZy family is represented as a single point on the plot.
Figure 7.21 shows the plotting of sensitivity against specificity for each Carbohydrate Binding Module CAZy family.
Figure 7.21: Scatter plot of recall (sensitivity) against specificity for predicting each CAZy family for each CAZyme classifier in the CAZy class Carbohydrate Binding Modules. Each CBM CAZy family is represented as a single point on the plot.
| Prediction_tool | Mean | Standard Deviation |
|---|---|---|
| dbCAN | 0.9997 | 0.0011 |
| HMMER | 0.9996 | 0.0014 |
| DIAMOND | 0.9998 | 0.0010 |
| Hotpep | 0.9991 | 0.0023 |
| CUPP | 0.9995 | 0.0015 |
| eCAMI | 0.9994 | 0.0017 |
Figure 7.22: 95% confidence interval around the mean of Rand Index (RI) of the performance of the CAZyme classifiers to predict the multilabel classification of CAZy classes.
| Prediction_tool | Mean | Standard Deviation |
|---|---|---|
| dbCAN | 0.9391 | 0.2359 |
| HMMER | 0.9250 | 0.2554 |
| DIAMOND | 0.9530 | 0.2105 |
| Hotpep | 0.8758 | 0.3083 |
| CUPP | 0.9098 | 0.2712 |
| eCAMI | 0.9077 | 0.2778 |
Overall, all CAZyme classifiers showed strong performances at all three levels of CAZyme classification (CAZyme/non-CAZyme. CAZy class and CAZy family).
Performance was extremely strong for CAZyme classifiers for across all levels of CAZyme classification, performance in CAZyme classifiers varied most greatly for sensitivity.
In general, the CAZyme/non-CAZyme, CAZy class and CAZy family classifications were accurate for all CAZyme classifiers (i.e. when a classification is predicted it was frequently correct). however, the CAZyme classifiers do not predict a comprehensive CAZome. CAZyme classifiers performance differed most greatly by sensitivity, which indicated an non-comprehensive annotation of the CAZome, CAZy class members and CAZy family members.
Classifying Bacterial or Eukaryote had neglebialbe impact on the performance of the CAZyme classification at at every level of classification (CAZyme/non-CAZyme. CAZy class and CAZy family).